#Introduction to Open Data Science - Course Project
Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 12 11:00:33 2022"
The text continues here.
Open the file chapter1.Rmd located in your IODS-project folder with RStudio. Just write some of your thoughts about this course freely in the file, e.g.,
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 12 11:00:33 2022"
1. How are you feeling right now?
I am excited about the course. It is a bit intimidating, since it seems that there is quite a lot of work. But, I cant wait to be able to apply what I will learn here to my own research.
2. What do you expect to learn?
I am a PhD student and I found the course content very fitting for my needs. I am the most excited to learn more about GitHub and R Markdown. Also, the classes regarding model validation (2), clustering/classification (4), and dimensionality reduction techniques (5) are very interesting topics for me, since I am doing my PhD on psychometric validation of (Short) Warwick-Edinburgh Mental Well-being Scale, (S)WEMWBS, among Finnish population.
3. Where did you hear about the course?
I remember taking one of the Kimmo’s course almost 10 years ago, when I was an undergraduate. Even though the course at the time was held in very early morning, I really enjoyed his class. I live in Austarlia, and When I noticed (in Sisu) that he is holding an online course again, I signed up immediately.
Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1:
4. How did it work as a “crash course” on modern R tools and using RStudio?
I have used RStudio before, so I am familiar with the program and I had everything already installed. However, this is the first time when I will be using R Markdown and GitHub.
I also have another statistics course at the moment where will be using R Markdown, so I am excited to learn the syntax and get familiar with the program, along with GitHub, to see how I can use it in my own research. GitHub for example, could work really well, when I have multiple different scripts that I will be testing/editing. I also really like the layout of R Markdown, it is so much easier to follow when you knit it, than normal R script. The interactive features are amazing and can be really cool to add as supplementary material into your manuscript, so people could view different scenarios and examine the topic a bit more deeper.
Also, the R Markdown Tutorial was very helpful. R Markdown Tutorial
5. Which were your favorite topics?
I really like the layout of R Markdown and the Cheat Sheet Cheat Sheet
Also, I found that the R for Health Data Science book very helpful and I know that I will be using it a lot in the future. It seems to have very illustrative examples and code that I can adapt to my own research. I think it will be much more useful when later on we have actual exercises when we need to write our own code. I tend to use Stack Overflow, general Googling, and other peoples code as dictionary or grammar book, when I need to solve some issues with my code. In my opinion you learn the best when you are simultaneously trying to apply the piece of code to solve a problem. Just reading/viewing it is also helpful, but it is hard to grasp all the information at once without a specific task you try to solve.
6. Which topics were most difficult?
I think I have okay understanding of R and RStudio. I know how to “read” and “edit” most of the code, intall and use new packages, etc. The difficult part is when you have an idea what you want to do, and you try to find the best way to edit the code (for example, getting certain colours, divide data based on stratas etc.). Sometimes the packages have different syntax than the “normal” R code, even the syntax in R Markdown is different (e.g., how to mark comments).
However, I found the example code in Exercise1.Rmd very helpful to get started. I would prefer if R for Health Data Science would also have a PDF version, since I prefer to have a copy saved on my personal laptop, so I could highlight and add comments to the text. Also, if I understood it correctly, the book is based on around using the tidyverse-package, since pipe %>% is a part of this package, and would not work if you don’t have tidyverse() installed. There are many ways to write the R code by using different packages and some are using the basic R code and some their own, and sometimes they are mixed. Having a tutorial that would help to understand which syntax you need/can use would be very beneficial.
However, I have not used the GitHub before, so I found it quite difficult to get it started and understand the layout and what things are saved to my personal computer/files and which are online. “Committing” and “Pushing” things to GitHub seemed also quite hard at the start.
I also find it challenging to learn/understand the YAML at the start of R Markdown script, and how to edit them
For example, my index.Rmd code did not run at the start and trying to find the ways to fix it was difficult. In the end it just worked even though I did not change anything - I think it was trying to knit the script into something else than html.
Also add in the file a link to your GitHub repository (that you created earlier): https://github.com/your_github_username/IODS-project
Remember to save your chapter1.Rmd file.
Open the index.Rmd file with RStudio.
At the beginning of the file, in the YAML options below the ‘title’ option, add the following option: author: “Your Name”. Save the file and “knit” the document (there’s a button for that) as an HTML page. This will also update the index.html file.
index.Rmd error code
Error in yaml::yaml.load(…, eval.expr =
TRUE) : Parser error: while parsing a block mapping at line 1, column 1
did not find expected key at line 3, column 3 Calls:
To make the connection between RStudio and GitHub as smooth as possible, you should create a Personal Access Token (PAT).
The shortest way to proceed is to follow the steps below. (Source: https://happygitwithr.com/https-pat.html)
Execute the R commands (preceded by ‘>’) in the RStudio Console (below the Editor):
> install.packages(“usethis”) > usethis::create_github_token()
GitHub website will open in your browser. Log in with your GitHub credentials.
Return to RStudio and continue in the Console:
> gitcreds::gitcreds_set()
Apparently, I already had PAT, but I decided to update it, so I could finish this assignment. Now you should be able to work with GitHub, i.e., push and pull from RStudio.
Upload the changes to GitHub (the version control platform) from RStudio. There are a few phases (don’t worry: all this will become an easy routine for you very soon!):
Note: It is useful to make commits often and even on
small changes.
Commits are at the heart of the version control system, as a single
commit represents a single version of the file.)
After a few moments, go to your GitHub repository at https://github.com/your_github_username/IODS-project to see what has changed (please be patient and refresh the page).
Also visit your course diary that has been automatically been updated at https://your_github_username.github.io/IODS-project and make sure you see the changes there as well.
After completing the tasks above you are ready to submit your
Assignment for the review (using the Moodle Workshop below).
Have the two links (your GitHub repository and your course
diary) ready!
Remember to get back there when the Review phase begins (see course
schedule).
| End of Assignment 1: Tasks and Instructions |
| *** |
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 12 11:00:34 2022"
TASK INSTRUCTIONS: Create a folder named ‘data’ in your IODS-project folder. Then create a new R script with RStudio. Write your name, date and a one sentence file description as a comment on the top of the script file. Save the script for example as create_learning2014.R in the data folder. Complete the rest of the steps in that script.
Figure demonstrates how to create a new folder.
Please see create_learning2014.R and lrn14_KS.csv to evaluate the Data wrangling from my a GitHub repository: https://github.com/kiirasar/IODS-project you can find the files in data folder.
First we install/use R packages we need to complete the assignment.
# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):
# install.packages("GGally")
#install.packages("GGally")
#install.packages("tidyverse")
#install.packages('readr')
#install.packages('ggplot2')
#install.packages("psych")
#install.packages("vtable")
library(vtable)
## Warning: package 'vtable' was built under R version 4.2.2
## Loading required package: kableExtra
## Warning: package 'kableExtra' was built under R version 4.2.2
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## Warning: package 'readr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(ggplot2)
TASK INSTRUCTIONS: Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url.
Explore the structure and the
dimensions of the data and describe the dataset
briefly, assuming the reader has no previous knowledge of it.
Information related to data can be found
here
# Read the data from your local drive using setwd()-command
# setwd('C:\\Users\\Kiira\\Documents\\PhD_SWEMWBS\\PhD Courses\\Courses in 2022\\PHD-302 Open Data Science\\IODS-project')
# lrn14 <- read_csv("data/lrn14_KS.csv")
# head(lrn14) #gender, age, A_att, A_deep, A_stra, A_surf, points
# View(lrn14)
# or from url
std14 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=T) # sep=separator is a comma, header=T
head(std14) #gender, age, attitude, deep, stra, surf, points
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
View(std14)
head() command is used to show the first 6 lines of the datase, whereas View() opens the whole dataset into a new tab.
NOTE. data from local drive is
named as lrn14 and data from url as
std14.
Read_csv-command worked on R before, but I could not knit the
document for some reason. This is why its only there as comments and I
decide use the url (std14) dataset. The data is exact same, only the
variable names are different. I will use the url data to complete the
assignment.
dim(std14)
## [1] 166 7
dim() is R function to explore the dimension of the dataset. The dataset has 166 rows (observations) and 7 columns (variables).You can read the name of the variables or have better look at the data by using head(std14) and View(std14)
str(std14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
str() is R function to explore the structure of the dataset. The dataframe has 166 observations and 7 variables, like in dim().
TASK INSTRUCTIONS: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
SUMMARY STATISTICS:
To explore the summaries of each variable I used
vtable-package and st()-command, also
know as sumtable().
Here is link
where you can find more information regarding vtable
and st()-command
st(std14) # the command prints a summary statistics table to Viewer-window
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| gender | 166 | ||||||
| … F | 110 | 66.3% | |||||
| … M | 56 | 33.7% | |||||
| age | 166 | 25.512 | 7.766 | 17 | 21 | 27 | 55 |
| attitude | 166 | 3.143 | 0.73 | 1.4 | 2.6 | 3.7 | 5 |
| deep | 166 | 3.68 | 0.554 | 1.583 | 3.333 | 4.083 | 4.917 |
| stra | 166 | 3.121 | 0.772 | 1.25 | 2.625 | 3.625 | 5 |
| surf | 166 | 2.787 | 0.529 | 1.583 | 2.417 | 3.167 | 4.333 |
| points | 166 | 22.717 | 5.895 | 7 | 19 | 27.75 | 33 |
Dataset std14 has a total of 166 observations (participants) and 7 variables (gender, age, attitude, deep, stra, surf and points). In the dataset:
NOTE. The different learning methods (deep, stra, surf) are average based on several items for each learning method. The summary display the basic descriptive statistics: mean, standard deviation, minim, lower and upper quartiles (25% and 75%) and maximum. The scale among learning techniques are 1-5.
Points denotes the students exam points in a statistics course exam.
BARPLOTS - Nominal variables:
I used ggplot-package and barplot to explore the distributions and counts based of gender (nominal)
# ggplot()=command, std14=dataframe, eas(x=variable) + type of plot
gg_gender <- ggplot(std14, aes(x=gender)) + geom_bar() #barplot for nominal variables.
gg_gender
# you can make the plots looking prettier by adding extra code:
ggplot(std14, aes(x=as.factor(gender), fill=as.factor(gender) )) +
geom_bar(aes(fill=gender)) +
geom_text(stat='count',aes(label=..count..),vjust=-0.3) + #Adding counts on top of the bars
labs(x = "", fill = "gender") + #filling bars based on gender
ggtitle("Barplot based on gender Learning 2014 dataset") + #adding title
ylab("count")+ xlab("gender") + #adding x and y labels
scale_x_discrete(labels=c("F" = "Female", "M" = "Male")) #changing F into female and M into male
According to the previous summary table and barplot dataset std14 has 110 female and 56 male participants.
HISTOGRAMS - Continuous variables:
I made histograms for every continuous variable: age, attitude, deep, stra, surf, and points, in order to check if these are normally distributed - meaning that the distribution follows the bell curve. If variables are not normally distributed, we can’t use parametric statistical approaches e.g., general regression models, but rather non-parametric statistical methods.
NOTE. When making plots, it is important to include everyone. Some people might have difficulties see all the colours e.g., colour blind, so it is imporant to use right colours. On this website you can find inclusive colour pallets.
The #CODE are referring to certain colours.
Also, I wanted to print all the plot in one page by using multiplot()-command which is part of ggplot-package. Before using the command I needed to run a code that can be found here
multiplot <- function(..., plotlist = NULL, file, cols = 1, layout = NULL) {
require(grid)
plots <- c(list(...), plotlist)
numPlots = length(plots)
if (is.null(layout)) {
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots == 1) {
print(plots[[1]])
} else {
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
Then I created a histogram for each continuous variable and used the colours from
# NOTE. you can also use "=" instead of "<-" to create objects. However, ofter "<-" is better, since some packages might use "=" for something else.
p1=ggplot(std14) +
geom_histogram(aes(x = age), fill = "#E69F00") +
labs(title="Age")
p2=ggplot(std14) +
geom_histogram(aes(x = attitude), fill = "#56B4E9") +
labs(title="Attitude")
p3=ggplot(std14) +
geom_histogram(aes(x = deep), fill = "#009E73")+
labs(title="Deep learning")
p4=ggplot(std14) +
geom_histogram(aes(x = stra), fill = "#F0E442")+
labs(title="Strategic learning")
p5=ggplot(std14) +
geom_histogram(aes(x = surf), fill = "#0072B2")+
labs(title="Surface learning")
p6=ggplot(std14) +
geom_histogram(aes(x = points), fill = "#D55E00")+
labs(title="Points")
Last, I ran the multiplot()-command.
multiplot(p1, p2, p3, p4, p5, p6, cols=3) #prints 3 columns
## Loading required package: grid
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Summary regarding the results:
Next, I wanted to create a histogram where I added all the learning strategies on top of each other.
# NOTE. alpha=.5, makes the colours trasparent 50%.
ggplot(std14) +
geom_histogram(aes(x = deep), fill = "#009E73", alpha=.5) + # green
geom_histogram(aes(x = stra), fill = "#F0E442", alpha=.5) + # yellow
geom_histogram(aes(x = surf), fill = "#0072B2", alpha=.5) + # blue
labs(title="Learnign strategies", x="Learning strategies (Mean)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Summary regarding the results:
RELATIONSHIP between the variables:
ggpairs()-comman is part of ggplot and it creates more advanced plot matrix where you can explore the relationships between the variables.
ggpairs(std14, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
#p <- ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# This draws two distributions etc for both male and female participants separately.
The command prints out:
Histograms (first 2 columns on left) based on gender (female, male)
Boxplots (first row) based on gender: female (top), male (bottom)
Normal distributions (diagonal) only for continuous variables
Correlations (up diagonal) only for continuous variables
Scatterplots - Relatinships between continuous variables
However, the figure is quite small, so it is easier to explore the scatterplots by using pairs() command.
# this piece of code excludes the gender (nominal variable)
pairs(std14[-1])
But even that is quite ugly. Also, the plots below and above the diagonial line are identical (just opposite scaling). To make the scatterplots nicer, we can create nicer scatterplots with ggplot.
Age scatterplots
# Age
sp1 <- ggplot(std14, aes(x = age, y = attitude)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and attitude")
sp2 <- ggplot(std14, aes(x = age, y = deep)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and deep learning")
sp3 <- ggplot(std14, aes(x = age, y = stra)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and strategic learning")
sp4 <- ggplot(std14, aes(x = age, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and surface learning")
sp5 <- ggplot(std14, aes(x = age, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: age and points")
multiplot(sp1, sp2, sp3, sp4, sp5, cols=3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# age and attitude, r=0.022
# age and deep learning, r=0.025
# age and strategic learning, r=0.102
# age and surface learning, r=-0.141
# age and points, r=-0.093
Since age is very skewed the correlation between other variables are very low. Meaning the scatterplots and regression line is very flat, indicating low or non-correlation. Age does not seem to be related to different learning techniques, attitudes or overall exam points. However, age was also very skewed meaning a lot of people were same age, that may affect the results.
Attitude scatterplots
## Attitude
sp6 <- ggplot(std14, aes(x = attitude, y = deep)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and deep learning")
sp7 <- ggplot(std14, aes(x = attitude, y = stra)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and strategic learning")
sp8 <- ggplot(std14, aes(x = attitude, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and surface learning")
sp9 <- ggplot(std14, aes(x = attitude, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: attitude and points")
multiplot(sp6, sp7, sp8, sp9, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#attitude and deep learning, r=0.110
#attitude and strategic learning, r=0.062
#attitude and surface learning, r= -0.176*
#attitude and points, r=0.437***
The first two graphs (on left column) had very small correlation and
the line was very flat.
The relationship between attitude and surface learning (top
right) shows a small significant negative correlation
(r= -0.176), indicating that the line goes down: people with
higher surface learning points, would have higher chance to have lower
attitude points as well.
This could mean that people who use surface learning techniques have
worsen attitude towards learning in general.
Alternatively, the relationship between attitude and points
(down right) show significant positive correlation
(r=0.437); the line goes up. Indicating that individuals with
high attitude points would often also have high overall points - and
vice versa; individual with low attitude would also have low overall
points.
One interpretation of these finding is that people who have good
attitude towards learning will also success better in their exams.
Deep learning scatterplots
# deep
sp10 <- ggplot(std14, aes(x = deep, y = stra)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: deep and strategic learning")
sp11 <- ggplot(std14, aes(x = deep, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: deep and surface learning")
sp12 <- ggplot(std14, aes(x = deep, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: deep learning and points")
multiplot(sp10, sp11, sp12, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#deep and strategic learning, r=0.097
#deep and surface learning, r= -0.324***
#deep learning and points, r= -0.010
Only, relationship between deep and surface learning (bottom left) had significant correlation (r= -0.324). The correlation was also negative, meaning that higher deep learning scores were associated with lower surface learning scores and vice versa.
This could mean that people who often use deep learning techniques do rarely use surface learning techiques and vice versa.
Other relationship showed barely any correlation and therefore the line was fairy flat.
Strategic learning
# stra
sp13 <- ggplot(std14, aes(x = stra, y = surf)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: strategic and surface learning")
sp14 <- ggplot(std14, aes(x = stra, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: strategic learning and points")
sp15 <- ggplot(std14, aes(x = surf, y = points)) +
geom_point() + #scatterplot
geom_smooth(method = "lm") + #regression line
labs(title="Scatterplot: surface learning and points")
multiplot(sp13, sp14, sp15, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#strategic and surface learning, r= -0.161*
#strategic learning and points, r=0.146
#surface learning and points, r= -0.144
Only the association between strategic and surface
learning showed significant correlation, which was negative
(r=-0.161), meaning that lower points were associated with
higher surface learning points.
This could mean that people who only use surface learning techniques
will struggle to grasp more deeper understanding of different concepts
that could lead lower exam points.
Below there is a code and figure with all the scatterplots by using multiplot(), but yet again, the graphs are too small, so the interpretation of the findings is difficult.
multiplot(sp6, sp7, sp8, sp9, sp10, sp11, sp12, sp13, sp14, sp15, cols=4) #prints 5 columns
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
TASK INSTRUCTIONS: Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
I choose these three independent variables, since they are different strategies how people learn.
my_model3 <- lm(points ~ deep + stra + surf, data = std14)
# my_model3 #call the linear model, intercept and slopes.
summary(my_model3) #summary of the model, including the single variable statistical significant summaries
##
## Call:
## lm(formula = points ~ deep + stra + surf, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## deep -0.7443 0.8662 -0.859 0.3915
## stra 0.9878 0.5962 1.657 0.0994 .
## surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
Based on the model only strategic and surface learning are statistical significant (. = 0.10), but deep learning is not.
However, the model explain only 4-2.3% of the exam results (Multiple R-squared = 0.04071 and Adjusted R-square = 0.02295). Also the overall p-value of the whole model is relatively bad 0.08016.
Overall, it seems that different learning techniques will pay either none or only little role in explaining overall exam points.
Since, deep learning is not statistically significant, I will remove it from the model and fit the model again without it.
NOTE. Overall, p<.01 is not very good result, normally p<.05 is the level of statistical significant results at least in my research field (psychology).
my_model2 <- lm(points ~ stra + surf, data = std14) #exclude deep learning from the model
summary(my_model2) #summary of the model
##
## Call:
## lm(formula = points ~ stra + surf, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4574 -3.2820 0.4296 4.0737 9.8147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.5635 3.3104 7.118 3.31e-11 ***
## stra 0.9635 0.5950 1.619 0.107
## surf -1.3828 0.8684 -1.592 0.113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.822 on 163 degrees of freedom
## Multiple R-squared: 0.03634, Adjusted R-squared: 0.02452
## F-statistic: 3.074 on 2 and 163 DF, p-value: 0.04895
When excluding the “deep learning” from the model, neither strategic or surface learning remain significant.
As additional task, I wanted to try a completely new model, where attitude, deep and surface learning could try to explain the exam points. I chose these, since they had the highest correlations. However, this might cause multi-collienarity that can impact on our results.
my_model32 <- lm(points ~ attitude + deep + surf, data = std14)
summary(my_model32) #summary of the model, including the single variable statistical significant summaries
##
## Call:
## lm(formula = points ~ attitude + deep + surf, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
Interestingly, neither deep or surface learning were significant independent variables to predict/explain exam points. However, attitude was highly significant (p<.0001). The results might be caused by the high positive association/correlation between attitude and exam points.
Overall, it seem that attitude plays much more bigger role explaining exam results than learning techniques.
The new model is also much better than the previous (model3): R-square was 0.2024, meaning that this model explains 20% variation in exam points. The models p-value was also much better than before: p<.0001
Lastly, I excluded both learning techniques from the model to see if we could increase the model fit.
my_model1 <- lm(points ~ attitude, data = std14)
summary(my_model1) #summary of the model, including the single variable statistical significant summaries
##
## Call:
## lm(formula = points ~ attitude, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
This had a little impact:
TASK INSTRUCTIONS: Produce the following diagnostic plots:
Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.
FURTHER INSTRUCTIONS based on Exerice 2: R makes it
easy to graphically explore the validity of your model assumptions, by
using plot()-command e.g., plot(my_model3).
In the plot() function argument which can help you to
choose which plots you want. We will focus on plots 1,
2 and 5:
| which | graphic |
|---|---|
| 1 | Residuals vs Fitted values |
| 2 | Normal QQ-plot |
| 3 | Standardized residuals vs Fitted values |
| 4 | Cook’s distances |
| 5 | Residuals vs Leverage |
| 6 | Cook’s distance vs Leverage |
Before the call to the plot() function, add the
following: par(mfrow = c(2,2)). This will
place the following 4 graphics to the same plot.
RESIDUALS
In general the following graphs are focusing on exploring the residuals
of the model.
In statistical point of view, residual is the difference between predicted values of y (dependent variable) and observed values of y . So. Residual = actual y value − predicted y value, (ri=yi−^yi).
Another example explaining residuals is the distance from the linear line. If the observations is located above the linear line, residual is positive and if the observation is located below the line, it is negative.
If our model would explain 100% of the variation of dependent variable, residual would be 0, meaning that all the observations would be touching the linear line.
In a way, you could say that residual is a measure of how well a linear line fits an individual data points.
The picture above is a screenshot from khanacademy.org
R-square on the other hand is calculated a correlation coefficient squared. Or, as well as, the sum of squares of residuals (also called the residual sum of squares, SSres) divided by the total sum of squares (proportional to the variance of the data, SStot) minus 1.
One way to illustrare the SSres is to draw squares between linear line and single data points in a way that the square would touch the linear line. The sum of the squares are SSres.
The picture above is a screenshot from Wikipedia.
Lastly, if the model explains only 5% of the variance of chosen dependent variable (outcome, y), it means that the residuals, everything else except the chosen independent variables (x), are explaining the rest 95% of the variance. Meaning, that we were not able to successfully detect the whole phenomena.
In the final assignment, I will use the first model (model3) as example
whereas, deep learning is not significant indicator, but strategic and surface learning variables are, but the R^2 is very low (2%)
par(mfrow = c(2,2))
plot(my_model3, which = c(1,2,5))
Residual vs Fitted
This graph illustrates the residuals non-linear patterns. If the
red line is roughly horizontal then we can assume that the residuals
follow a linear pattern. If the residuals are not linear, we
would need to consider non-parametric statistical approaches to
investigate the relationship between the variables. Sometimes, there
might be relationship between the variables, even though it would not be
linear. This plot helps us to detect any other possible non-linear
relationship (parabolic/quadratic/polynominal, expotential, steps,
etc.)
Based on the graph, in model 3 the line seem to be fairly
horizontal, so we can claim that the residuals are following linear
patter along with the indicators (dependent,
x-variables).
NOTE. This are just “raw” residuals, not
standardized
Normal Q-Q
This graph is illustrating if the residuals of the regression model are
the normal distributed. In a perfect world, the dots would align
with the linear line, indicating that the residuals are in deed normally
distributed. Each data point is presented in this picture
(dot).
In model 3 it looks like that some observations from the begin and end of the data set are not in line with the linear model. However, overall it is roughly following the line, so we can confirm that the residuals are normally distributed.
Residual vs Leverage This graph is mainly used to spot influential data points, aka outliers, or single data points that could have an impact on our model. This graph can also be used to examine heteroskedasticity (different variance based on different independent variables) which can often be caused by an outlier. We can also investigate non-linearity with this graph.
See more Rummerfield & Berman, 2017, page 3
In our exmaple,
-we have some observations that are below -2 (y-axis) e.g., observations 145, 35, and 19 - The the average leverage is 4/166 ≈ 0.024 then any data point beyond 0.0482 or ≈ 0.05 (2 × 0.024) has a high leverage value. In our model 3 there a some obervations passed 0.5 (x-axis), which we could drop out. - However, our we cant even see the Cook’s distance contour line and these outliers are not too far away from the suggested cut-off lines.
Overall, we can conclude that our data does not have any (or only few) influential data points.
End of Assignment 2: Tasks and Instructions
(more chapters to be added similarly as we proceed with the course!)
date()
## [1] "Mon Dec 12 11:01:21 2022"
Please see create_alc.R to evaluate the Data wrangling from my a GitHub repository: https://github.com/kiirasar/IODS-project you can find the files in data folder.
First we install/use R packages we need to complete the assignment.
# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):
# install.packages("GGally")
#install.packages("GGally")
#install.packages("tidyverse")
#install.packages('readr')
#install.packages('ggplot2')
#install.packages("psych")
#install.packages("vtable")
library(vtable)
library(psych)
library(GGally)
library(tidyverse)
library(readr)
library(ggplot2)
library(tidyr)
library(dplyr)
To read the dataset from either my local folder (read_csv()) or from url (reab.table()) use the commands in brackets.
alc_KS <- read_csv("data/create_alc_KS.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(alc_KS)
## # A tibble: 6 × 35
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health servic… home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reput…
## # … with 24 more variables: guardian <chr>, traveltime <dbl>, studytime <dbl>,
## # schoolsup <chr>, famsup <chr>, activities <chr>, nursery <chr>,
## # higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>,
## # goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>, failures <dbl>,
## # paid <chr>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, alc_use <dbl>,
## # high_use <lgl>
# or from url
alc_a3 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=T) # sep=separator is a comma, header=T
head(alc_a3) # Shows the first 5 rows of the dataset
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime schoolsup famsup activities nursery higher
## 1 mother 2 2 yes no no yes yes
## 2 father 1 2 no yes no no yes
## 3 mother 1 2 yes no no yes yes
## 4 mother 1 3 no yes yes yes yes
## 5 father 1 2 no yes no yes yes
## 6 mother 1 2 no yes yes yes yes
## internet romantic famrel freetime goout Dalc Walc health failures paid
## 1 no no 4 3 4 1 1 3 0 no
## 2 yes no 5 3 3 1 1 3 0 no
## 3 yes no 4 3 2 2 3 3 2 yes
## 4 yes yes 3 2 2 1 1 5 0 yes
## 5 no no 4 3 2 1 2 5 0 yes
## 6 yes no 5 4 2 1 2 5 0 yes
## absences G1 G2 G3 alc_use high_use
## 1 5 2 8 8 1.0 FALSE
## 2 3 7 8 8 1.0 FALSE
## 3 8 10 10 11 2.5 TRUE
## 4 1 14 14 14 1.0 FALSE
## 5 2 8 12 12 1.5 FALSE
## 6 8 14 14 14 1.5 FALSE
View(alc_a3) # Preview of the whole data set.
# In this assignment I will be using the dataset from url (alc_a3)
print(colnames(alc_a3)) # print the column names
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
glimpse(alc_a3) # have a bit better look at the data
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
Short description of the dataset alc_a3 by using glimpse()-command.
More detailed description of the dataset can be found here and Exercise3.Rmd as well as in Moodle (Assignment 3: Tasks and Instructions)
Variables:
High alcohol consumption (high_use: TRUE, FALSE) and sex (FEMALE, MALE): cross-tabulation and barplot
Creating across-tabulation
t0 <- xtabs(~high_use+sex, data=alc_a3)
ftable(t0) # print table.
## sex F M
## high_use
## FALSE 154 105
## TRUE 41 70
summary(t0) # chi-square test of indepedence.
## Call: xtabs(formula = ~high_use + sex, data = alc_a3)
## Number of cases in table: 370
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 15.812, df = 1, p-value = 6.996e-05
# or
# http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
t0_f = crosstab(alc_a3, row.vars = "high_use", col.vars = "sex", type = "f")
t0_f #frequency count
## sex F M Sum
## high_use
## FALSE 154 105 259
## TRUE 41 70 111
## Sum 195 175 370
t0_c = crosstab(alc_a3, row.vars = "high_use", col.vars = "sex", type = "c")
t0_c # %column
## sex F M
## high_use
## FALSE 78.97 60.00
## TRUE 21.03 40.00
## Sum 100.00 100.00
Based on ftable(t0) and t0_f we can see that among females 154 drink below high use (FALSE) and 41 have high use (TRUE). Among males the frequensies are 105 and 74, respectively.
Based on the t0_c it is easier to establish the difference between sex. Male seem indeed use more alcohol (high use 40%) than females (high use 21%).
Based in the summary(0) you can also see that this difference is significant p<.001, x^2(1)=15.81
COLOUR PALETTES
Next we wil create a bar plot based on the results. To do so, we need to creat a new dataframe. Tips can be found here.
When making plots, it is important to include everyone. Some people might have difficulties see all the colours e.g., colour blind, so it is imporant to use right colours. On this website you can find inclusive colour pallets.
group_col=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2") #saving colour-blind "safe" colours
Creating a dataframe
# data frame based on frequencies
df = data.frame(group=c("Female Low", "Male Low", "Female High", "Male High"),
value=c(154, 105, 41, 70))
df
## group value
## 1 Female Low 154
## 2 Male Low 105
## 3 Female High 41
## 4 Male High 70
# data frame based on column %
df_p = data.frame(group=c("Female Low", "Male Low", "Female High", "Male High"),
percentage=c(79, 60, 21, 40))
df_p
## group percentage
## 1 Female Low 79
## 2 Male Low 60
## 3 Female High 21
## 4 Male High 40
Creating a plot - sex and alcohol consumption (high_use)
#based on frequency
ggplot(df, aes(x=group, y=value, fill=group)) + #basic plot
geom_bar(stat="identity") + #define a plot and put groups are side-by-side
geom_text(aes(label=value), vjust=-0.3, size=3.5) + #add frequencies
scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours
#based on column%
ggplot(df_p, aes(x=group, y=percentage, fill=group)) + #basic plot
geom_bar(stat="identity") + #define a plot and put groups are side-by-side
geom_text(aes(label=percentage), vjust=-0.3, size=3.5) + #add frequencies
scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours
Based on the plot you can also easily to see that based on the column% female participants consumed less alcohol (low=79%, high=21%) in comparison to male students (low=60%, high=40%). This result support our previous hypothesis: male has higher alcohol consumption than females. In other words, being a mae might be arisk factor when considering alcohol consumption among students.
High alcohol consumption (high_use) and absences: summary and boxplot
Use tapply()-command to search the basic summary of absences-variable.
tapply(alc_a3$absences, alc_a3$high_use, summary)
## $`FALSE`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 3.00 3.71 5.00 45.00
##
## $`TRUE`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 6.378 9.000 44.000
Absence seem to be higher among students with high acohol consumption habits: Mean=6.4, Mdn=4.0, similarly both the quartilies (1 and 3) and maximum were higher than for those who had low alcohol consumption habits. However, tapply() command does not test the statistical significancy, nor give SD values or confidential intervals (C.I.), so it is hard to make conclusion that this difference would be statistically significant.
Next, we initialize a boxplot of high_use and absences
g1 <- ggplot(alc_a3, aes(x = high_use, y = absences, fill=high_use)) + #alcohol consumption with absences (numeric)
geom_boxplot() + ylab("Absences") + xlab("High alcohol use (more than 2)") +
scale_fill_manual(values=c("#D55E00", "#0072B2")) #add colours
g1
Again, you see that the mean and both quarterlies are higher for TRUE values than fro FALSE values. However, the C.I.s are align with each other (lines) indicating that even though there is a difference in absences, it would not be statistically significant. However, excluding some of the outliers, might change the results. It seems like that there is one participant (datapoint) whose drinking habits are low (FALSE), but they have over 40 absences. Similar can be found in TRUE-values (high drinking) as well. In other words, it is unlikely that abseces from school is a risk factor when considering alcohol consumption among students.
High alcohol consumption (high_use) and going out with friends (goout: numeric 1=very low to 5=very high): cross-tabulation and barplot
Creating a cross-tabulation
# Cross-tabulation
t1 <- xtabs(~high_use+goout, data=alc_a3)
ftable(t1) # print table.
## goout 1 2 3 4 5
## high_use
## FALSE 19 82 97 40 21
## TRUE 3 15 23 38 32
summary(t1) # chi-square test of indepedence.
## Call: xtabs(formula = ~high_use + goout, data = alc_a3)
## Number of cases in table: 370
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 55.57, df = 4, p-value = 2.463e-11
# or http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
t1_f = crosstab(alc_a3, row.vars = "high_use", col.vars = "goout", type = "f")
t1_f #frequency count
## goout 1 2 3 4 5 Sum
## high_use
## FALSE 19 82 97 40 21 259
## TRUE 3 15 23 38 32 111
## Sum 22 97 120 78 53 370
t1_c = crosstab(alc_a3, row.vars = "high_use", col.vars = "goout", type = "c")
t1_c # %column
## goout 1 2 3 4 5
## high_use
## FALSE 86.36 84.54 80.83 51.28 39.62
## TRUE 13.64 15.46 19.17 48.72 60.38
## Sum 100.00 100.00 100.00 100.00 100.00
Based in the summary(0) you can also see that there is a significant difference p<.001, x^2(4)=55.57 between going out with friends and alcohol consumption. However, the chi-square test does not tell where this significant difference is located. However, based on the col% we can guess that the difference in alcohol consumption is higher among students who see their friends more often (scoring 4 or 5: F = 51% and 40%; T = 49% and 60%) in comparison students who dont see their friends that often (scoring 1-3: F ~ 80%; T ~ 20%).
Creating a dataframe
# data frame based on frequencies
df_p2 = data.frame(group=c("1 False", "2 False", "3 False", "4 False", "5 False",
"1 True", "2 True", "3 True", "4 True", "5 True"),
percentage=c(86, 85, 81, 51, 40,
14, 15, 19, 49, 60))
df_p2
## group percentage
## 1 1 False 86
## 2 2 False 85
## 3 3 False 81
## 4 4 False 51
## 5 5 False 40
## 6 1 True 14
## 7 2 True 15
## 8 3 True 19
## 9 4 True 49
## 10 5 True 60
Creating a plot - going out with friens (goout) and alcohol consumption (high_use)
#based on column%
ggplot(df_p2, aes(x=group, y=percentage, fill=group)) + #basic plot
geom_bar(stat="identity") + #define a plot and put groups are side-by-side
geom_text(aes(label=percentage), vjust=-0.3, size=3.5) + #add frequencies
scale_fill_manual(values=c("#E69F00", "#D55E00", "#E69F00", "#D55E00", "#E69F00", "#D55E00","#56B4E9", "#0072B2", "#56B4E9", "#0072B2")) #add colours
Based on the plot you can also easily to see that based on the column% studeny who don’t see their friends that often will often have less prevelance of high alcohol consumption, whereas the prevelance for high alcohol consumption seem to be ~50%-50%, or even higher 60%-40% among students who see their friends often. These result support our previous hypothesis: going out with friends (higher incidence) is associated with higher risk of high alcohol consumption.
High alcohol consumption (high_use) and wanting to go take higher education after graduation (higher: YES, NO): cross-tabulation and barplot
Creating a cross-tabulation
# Cross-tabulation
t2 <- xtabs(~high_use+higher, data=alc_a3)
ftable(t2) # print table.
## higher no yes
## high_use
## FALSE 7 252
## TRUE 9 102
summary(t2) # chi-square test of indepedence.
## Call: xtabs(formula = ~high_use + higher, data = alc_a3)
## Number of cases in table: 370
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 5.487, df = 1, p-value = 0.01916
## Chi-squared approximation may be incorrect
# or http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
t2_f = crosstab(alc_a3, row.vars = "high_use", col.vars = "higher", type = "f")
t2_f #frequency count
## higher no yes Sum
## high_use
## FALSE 7 252 259
## TRUE 9 102 111
## Sum 16 354 370
t2_c = crosstab(alc_a3, row.vars = "high_use", col.vars = "higher", type = "c")
t2_c # %column
## higher no yes
## high_use
## FALSE 43.75 71.19
## TRUE 56.25 28.81
## Sum 100.00 100.00
Based in the summary(0) you can see that there is not a significant difference p<.02, x^2(1)=5.487 between alcohol consumption and wanthing to go higher education. There might be a small trend though. Also, the variable is very skewed, since only few students don’t want to continue their studies (n=16).
Creating a dataframe
# data frame based on frequencies
df_3 = data.frame(group=c("No False", "No True", "Yes False", "Yes True"),
value=c(7, 9, 252, 111))
df_3
## group value
## 1 No False 7
## 2 No True 9
## 3 Yes False 252
## 4 Yes True 111
# data frame based on col%
df_p3 = data.frame(group=c("No False", "No True", "Yes False", "Yes True"),
percentage=c(44, 56, 71, 29))
df_p3
## group percentage
## 1 No False 44
## 2 No True 56
## 3 Yes False 71
## 4 Yes True 29
NOTE.
Creating a plot - wanthing to continue higher education (higher) and alcohol consumption (high_use)
#based on column%
ggplot(df_p3, aes(x=group, y=percentage, fill=group)) + #basic plot
geom_bar(stat="identity") + #define a plot and put groups are side-by-side
geom_text(aes(label=percentage), vjust=-0.3, size=3.5) + #add frequencies
scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours
Based on the plot you can see that the alcohol consumption is quite even (F = 44%, TRUE = 56%) between students who don’t want to continue higher education, yet a bit over half have high alcohol consumption. Majority (71%) of students who want to continue their studies reported low alcohol consumption, and 29% high alcohol use. However, since the data is skewed we might also want to print the frequency plot.
#based on frequencies
ggplot(df_3, aes(x=group, y=value, fill=group)) + #basic plot
geom_bar(stat="identity") + #define a plot and put groups are side-by-side
geom_text(aes(label=value), vjust=-0.3, size=3.5) + #add frequencies
scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours
In this example it might be smarter to present the frequence plot, since otherwise people might miss-interpret the plot.
These result support sort of our previous hypothesis: wanting to take higher education is associated with lower alcohol consumption. However, this interpretation is not very sientific. Better interpretation is to establish that there is not enough data indicating this would be the case, since only 16 participants did not want to continue their studies.
Hint: If your model includes factor variables see for example the RHDS book or the first answer of this stackexchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this).
This model predicts high alcohol consumption (more than 2). Because, the variable is not continuous, but binary (FALSE, TRUE) we need to use general linear model or mixed-model, where we specify the model as “binomial”.
The models variables are:
m_0 <- glm(high_use ~ sex + absences + goout + higher, data = alc_a3, family = "binomial") # glm()
summary(m_0) # you can also get the whole summary of the model using summary()-command
##
## Call:
## glm(formula = high_use ~ sex + absences + goout + higher, family = "binomial",
## data = alc_a3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7787 -0.8120 -0.5286 0.7990 2.4772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.67317 0.77959 -4.712 2.46e-06 ***
## sexM 0.99134 0.26216 3.781 0.000156 ***
## absences 0.08279 0.02289 3.617 0.000298 ***
## goout 0.72110 0.12093 5.963 2.48e-09 ***
## higheryes -0.48263 0.59294 -0.814 0.415674
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 373.43 on 365 degrees of freedom
## AIC: 383.43
##
## Number of Fisher Scoring iterations: 4
OR_0 <- coef(m_0) %>% exp # compute odds ratios (OR)
CI_0 <- confint(m_0) # compute confidence intervals (CI)
## Waiting for profiling to be done...
cbind(OR_0, CI_0) # print out the odds ratios with their confidence intervals
## OR_0 2.5 % 97.5 %
## (Intercept) 0.02539593 -5.24084803 -2.1674521
## sexM 2.69484233 0.48359815 1.5136772
## absences 1.08631589 0.03959962 0.1303793
## goout 2.05669840 0.48977814 0.9649756
## higheryes 0.61716027 -1.68255125 0.6719255
Every other pedictor, except higher, is statistically significant p<.001.
sex:
Absences:
Going out with friends (goout):
Wanting to go higher education (higher):
NOTE. The difference between Null and Residual deviances tells us the model fit. Greater difference, better fit. However, this is arbitrary. Odds ratios in general:
The results, except higher-variable, are in line with our hypothesis.
2x2 cross tabulation of predictions versus the actual values
m1 <- glm(high_use ~ sex + absences + goout, data = alc_a3, family = "binomial") #drop the higher variable from the model
alc_a3 <- mutate(alc_a3, probability = predict(m1, type = "response"))
alc_a3 <- mutate(alc_a3, prediction = probability > 0.5)
select(alc_a3, high_use, sex, absences, goout, probability, prediction) %>% tail(10) #sanity check.
## high_use sex absences goout probability prediction
## 361 FALSE M 3 3 0.32716276 FALSE
## 362 FALSE M 0 2 0.15403200 FALSE
## 363 TRUE M 7 3 0.40566343 FALSE
## 364 FALSE F 1 3 0.12866204 FALSE
## 365 FALSE F 6 3 0.18408147 FALSE
## 366 FALSE F 2 2 0.07202493 FALSE
## 367 FALSE F 2 4 0.24971608 FALSE
## 368 FALSE F 3 1 0.03919794 FALSE
## 369 TRUE M 4 5 0.69415157 TRUE
## 370 TRUE M 2 1 0.09434569 FALSE
The probability indicates how well our model fits in single datapoints. If the model predicts the datapoint well (over 0.5 probability) it gets value TRUE (prediction).
Graphic visualizing of actual values and the predictions.
ggplot(alc_a3, aes(x = high_use, y = probability)) +
geom_point(size=2, aes(colour=prediction))
The plot illustrates the probability that single data points are succecfully (in blue) predicted in our model (probability is over 0.50). Based on the plot, you can see that we are actually missing fair bit of data points (in red) which our model fails to explain.
table(high_use = alc_a3$high_use, prediction = alc_a3$prediction) %>% # tabulate the target variable versus the predictions
prop.table() %>% # explore probabilities.
addmargins() #add margins.
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65405405 0.04594595 0.70000000
## TRUE 0.16486486 0.13513514 0.30000000
## Sum 0.81891892 0.18108108 1.00000000
Based on the first cross-tabultaion table, we have 67 datapoints that the model was able to successfully predict (prediction = TRUE; 17+50=67) and 303 that it was not able to predict (prediction = FALSE; 242+61=303) with our model.
We can calculate the success and error rates as following:
Specifically, it seems that our model had difficulties predict the “low use alcohol”-category, only 7%.
Based on the second cross-tabluation table prop.table() %>%, we get the same table, but portions. Similarly ~ 19% was successfully predicted based on our model (5% + 14%) and ~ 81% unsuccessful (65% + 16%).
Based on the third cross-tabulation table addmargins() calculates the row, column and over all %, supporting the previous explenations: 81% error rate and 18% success rate.
Total proportion of inaccurately classified individuals training error First, run this code.
loss_func <- function(class, prob) { # define a loss function (mean prediction error)
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# calculates the average wrong predictions of high_use
loss_func(class = alc_a3$high_use, prob = 0) #0.3
## [1] 0.3
loss_func(class = alc_a3$high_use, prob = 0.2) #0.3
## [1] 0.3
loss_func(class = alc_a3$high_use, prob = 0.5) #0
## [1] 0
loss_func(class = alc_a3$high_use, prob = 0.7) #0.7
## [1] 0.7
loss_func(class = alc_a3$high_use, prob = 1) #0.7
## [1] 0.7
loss_func(class = alc_a3$high_use, prob = alc_a3$probability) #0.2108108
## [1] 0.2108108
Unfortunately, I don’t know what those differences comes from. I assume that the total proportion of inaccurate classifies individuals (=training error) is the code, where the prob=0. If this is the case I would assume that our model would miss-classify 30% of the data points. But yet again, this is just a hunch not actually knowledge. Or maybe its the opposite, where prob=1, and our model would miss-classifdy 70%, which is more closer to the cross-tabulation and plot (~82% vs ~18%)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5 # loss function (average prediction error) more than 50%
mean(n_wrong)
}
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
# This is the Exercise3 model.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
cv$delta[1] # 0.2405405
## [1] 0.2405405
# My model
loss_func(class = alc_a3$high_use, prob = 0) #0.3; average of wrong predictions
## [1] 0.3
#My model with K-folding
cv_a3 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m1, K = 10) # K-fold cross-validation, with 10-fold cross-validation
# average number of wrong predictions in the cross validation
cv_a3$delta[1] #0.2108108
## [1] 0.2135135
My model has a bit better (0.21<0.24) test set than the Exercise3 example
Variables:
Individual predictors (model_a):
Family predictors (model_b):
Relationship predictors (model_c):
m_e1 <- glm(high_use ~ sex + age + Pstatus + Medu + Fedu + famsup + romantic + famrel + goout, data = alc_a3, family = "binomial") # glm()
summary(m_e1) # you can also get the whole summary of the model using summary()-command
##
## Call:
## glm(formula = high_use ~ sex + age + Pstatus + Medu + Fedu +
## famsup + romantic + famrel + goout, family = "binomial",
## data = alc_a3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5034 -0.7839 -0.5248 0.8328 2.7038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.08144 1.99332 -2.048 0.04060 *
## sexM 1.00902 0.26233 3.846 0.00012 ***
## age 0.13919 0.11148 1.249 0.21181
## PstatusT -0.16469 0.41922 -0.393 0.69444
## Medu -0.13851 0.15388 -0.900 0.36806
## Fedu 0.06658 0.15400 0.432 0.66551
## famsupyes 0.02742 0.27038 0.101 0.91923
## romanticyes -0.26041 0.27990 -0.930 0.35218
## famrel -0.45258 0.14096 -3.211 0.00132 **
## goout 0.78842 0.12423 6.347 2.2e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 375.84 on 360 degrees of freedom
## AIC: 395.84
##
## Number of Fisher Scoring iterations: 4
m_e2 <- glm(high_use ~ Pstatus + Medu + Fedu + famsup + romantic + famrel + goout, data = alc_a3, family = "binomial") # glm()
summary(m_e2) # you can also get the whole summary of the model using summary()-command
##
## Call:
## glm(formula = high_use ~ Pstatus + Medu + Fedu + famsup + romantic +
## famrel + goout, family = "binomial", data = alc_a3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7118 -0.7869 -0.5605 0.9787 2.4585
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.50647 0.80400 -1.874 0.06097 .
## PstatusT -0.11758 0.40944 -0.287 0.77398
## Medu -0.10248 0.14972 -0.684 0.49368
## Fedu 0.05997 0.14909 0.402 0.68750
## famsupyes -0.17237 0.25814 -0.668 0.50430
## romanticyes -0.25006 0.26880 -0.930 0.35222
## famrel -0.39998 0.13538 -2.954 0.00313 **
## goout 0.80180 0.12025 6.668 2.59e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 392.67 on 362 degrees of freedom
## AIC: 408.67
##
## Number of Fisher Scoring iterations: 4
m_e3 <- glm(high_use ~ romantic + famrel + goout, data = alc_a3, family = "binomial") # glm()
summary(m_e3) # you can also get the whole summary of the model using summary()-command
##
## Call:
## glm(formula = high_use ~ romantic + famrel + goout, family = "binomial",
## data = alc_a3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5967 -0.7778 -0.5440 0.9544 2.4264
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8394 0.6253 -2.941 0.00327 **
## romanticyes -0.2582 0.2684 -0.962 0.33607
## famrel -0.3969 0.1350 -2.939 0.00329 **
## goout 0.7954 0.1195 6.655 2.83e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 393.71 on 366 degrees of freedom
## AIC: 401.71
##
## Number of Fisher Scoring iterations: 4
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5 # loss function (average prediction error) more than 50%
mean(n_wrong)
}
set.seed(2367)
cv_e1_test <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e1, K = nrow(alc_a3))
cv_e1_test$delta[1]
## [1] 0.2324324
cv_e1 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e1, K = 10)
cv_e1$delta[1]
## [1] 0.2351351
cv_e2_test <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e2, K = nrow(alc_a3))
cv_e2_test$delta[1]
## [1] 0.2594595
cv_e2 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e2, K = 10)
cv_e2$delta[1]
## [1] 0.2567568
cv_e3_test <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e3, nrow(alc_a3))
cv_e3_test$delta[1]
## [1] 0.2513514
cv_e3 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e3, K = 10)
cv_e3$delta[1]
## [1] 0.2513514
Comment added later: set.seed(2367) is a random number that is used each time to generate the cv values. This means we should get the same values everytime now, without it the cv-values differ each time, since it re-generates them each time. We need this since cv.glm() has a random generator build in the function.
e1d_test=cv_e1_test$delta[1]
e1d=cv_e1$delta[1]
e2d_test=cv_e2_test$delta[1]
e2d=cv_e2$delta[1]
e3d_test=cv_e3_test$delta[1]
e3d=cv_e3$delta[1]
deltas_test=c(e1d_test, e2d_test, e3d_test)
deltas_test
## [1] 0.2324324 0.2594595 0.2513514
deltas_train=c(e1d, e2d, e3d)
deltas_train
## [1] 0.2351351 0.2567568 0.2513514
deltas=c(e1d_test, e2d_test, e3d_test,e1d, e2d, e3d)
deltas
## [1] 0.2324324 0.2594595 0.2513514 0.2351351 0.2567568 0.2513514
group_e1=c("testing", "testing","testing", "training", "training", "training")
label_names=c("all test", "family test", "relationship test",
"all train", "family train", "relationship train")
superbonus1 = data.frame(x=c(1:3),
deltas,
group_e1,
label_names)
library(ggrepel)
ggplot(superbonus1, aes(x = x, y = deltas, color = group_e1, group = group_e1)) +
geom_point() + geom_line() +
geom_label_repel(aes(label = label_names),
box.padding = 0.35,
point.padding = 0.5,
segment.color = 'grey50')
Unsure if this is the correct one. I found this link that I found useful.
NOTE. I am not very confident that I got this right, nor I even understood the task correct. I need to go through others assignments and see if I can find solution there.
| End of assignment 3. |
| *** |
r date() |
## [1] "Mon Dec 12 11:02:24 2022" |
| # Assignment 4: Tasks and Instructions |
| ## 2. Analysis (max 15 points) |
| First we install/use R packages we need to complete the assignment. |
r #install.packages('ggplot2') library(ggplot2) library(vtable) library(MASS) |
## ## Attaching package: 'MASS' |
## The following object is masked from 'package:dplyr': ## ## select |
r library(tidyr) library(dplyr) library(corrplot) |
## corrplot 0.92 loaded |
r library(GGally) library(plotly) |
## Warning: package 'plotly' was built under R version 4.2.2 |
## ## Attaching package: 'plotly' |
## The following object is masked from 'package:MASS': ## ## select |
## The following object is masked from 'package:ggplot2': ## ## last_plot |
## The following object is masked from 'package:stats': ## ## filter |
## The following object is masked from 'package:graphics': ## ## layout |
| ### 1. Create a new R Markdown |
| Create a new R Markdown file chapter4.Rmd and include it as a child file in your ‘index.Rmd’ file. |
| ### 2. Boston data (0-1 points) |
| Load Boston data from MASS package. Explore structure, dimensions and describe the dataset briefly. (0-1 points) |
r library(MASS) data("Boston") #download the Boston data dim(Boston) #dimensions |
## [1] 506 14 |
r str(Boston) #structure |
## 'data.frame': 506 obs. of 14 variables: ## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... ## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... ## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... ## $ chas : int 0 0 0 0 0 0 0 0 0 0 ... ## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... ## $ rm : num 6.58 6.42 7.18 7 7.15 ... ## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... ## $ dis : num 4.09 4.97 4.97 6.06 6.06 ... ## $ rad : int 1 2 2 3 3 3 5 5 5 5 ... ## $ tax : num 296 242 242 222 222 222 311 311 311 311 ... ## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... ## $ black : num 397 397 393 395 397 ... ## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... ## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ... |
| Interpretation of the results |
| Boston dataset has 506 rows (participants/data points)and 14 columns (variables) |
| - crim: per capita crime rate by town. - zn: proportion of residential land zoned for lots over 25,000 sq.ft. - indus: proportion of non-retail business acres per town - chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). - nox: nitrogen oxides concentration (parts per 10 million). - rm: average number of rooms per dwelling. - age: proportion of owner-occupied units built prior to 1940. - dis: weighted mean of distances to five Boston employment centres. - rad: index of accessibility to radial highways. - tax: full-value property-tax rate per $10,000. - ptratio: pupil-teacher ratio by town. - black: 1000(Bk−0.63) 2 where BkBk is the proportion of blacks by town. - lstat: lower status of the population (percent). - medv: median value of owner-occupied homes in $1000s. |
| All the variables are numeric, but some might be ordinal scaling rather than interval or absolute scaling. More detailed description of the dataset can be found here |
| ### 3. Graphical overview and variable summaries.(0-2 points) |
| - Show a graphical overview - Show summaries of the variables in the data. - Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. |
r gg = ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size=2)), title="Graphical overview of Boston data") gg1 = gg + theme(axis.text = element_text(size=5), strip.text.x = element_text(size = 5), strip.text.y = element_text(size=5)) #change the size of the text gg1 |
| Interpretation of the figure |
| There is quite many graphs in display, so it is a bit hard to read the results: |
| - Diagonally: is presented the distributions of each variable. - Below diagonal line: is presented the scatter plots between variables. - Above diagonal line: is presented the correlations between the variables. |
| Only few variables, such as rm (average number of rooms per dwelling) and medv (median value of owner-occupied homes in $1000s) seem to be normally distributed. All the other variables, except chas (Charles River dummy variable) are more or less continuous variables. There seem to be several linear relationship between variables, but also many non-linear relationships can be detected. For example, based on the distributions and scatter plots crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft.) and black: (1000(Bk−0.63)) seem to be more exponential rather than normal distribution. Lot of the correlations between the variables are weak, yet significant. However, if variables are not normally distributed calculating Pearsons correlations is not the most informative way to explore the dataset. |
| Another way to illustrate the relationships between the variables is to use corrplot()-command that is part of corrplot-package. |
r cor_matrix <- cor(Boston) corrplot(cor_matrix, method="circle") |
| Interpretation of the plot |
| The strength of the correlation is illustrated by adjusting the size and transparency of the correlations: |
| - red indicates negative correlation - blue indicates positive correlation - size indicates if the correlation is significant |
| As you can see in the picture, most relatinships are either unsignificant or atleast weak. This is due to the non-linear properties of most of the variables in Boston dataset. |
r st(Boston) # the command prints a summary statistics table to Viewer-window |
| Interpretation of the results |
| The summary table displays the summary statistics (n, mean, std, min-max, 25% and 75% percentiles). There are no missing values in the data set. The summary statistics can also indicate the distribution/linearity of the variables. |
| For example, we know that crim has exponential distribution. The data supports this: |
| - min-max = 0.006-88.976, - yet mean is only 3.614, and - the 75% of the data (Pctl. 75) has a value less than 3.677 |
| Whereas, for rm is normally distributed: |
| - min-max = 3.561-8.78, - pctl.25 = 5.886, - mean = 6.285, - pctl.75 = 6.624 |
| 25% and 75% are close to min and max and mean is located somewhere in the middle. One way to get rid of the issue on non-linearity is to standardize our data, so it will become “normally distributed”. |
| ### 4. Standardize Boston dataset (0-2 points) |
| - Standardize the dataset - Print out summaries of the scaled data. - How did the variables change? |
r boston_scaled_KS <- as.data.frame(scale(Boston)) st(boston_scaled_KS) |
| When we standardize variables, we set the same scale for all variables, which allows you to compare scores between different types of variables. When standardizing we set the mean = 0 and std = 1, as can be seen on the summary statistics. Negative values are smaller than mean and positive higher than the mean. |
| - Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). - Use the quantiles as the break points in the categorical variable. - Drop the old crime rate variable from the dataset. - Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. |
| ```r # Create categorical variable boston_scaled_KS\(crim <- as.numeric(boston_scaled_KS\)crim) #quantiles are Pctl.25 = -0.411 and Pctl.75 = 0.007 |
| bins <- quantile(boston_scaled_KS$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%) bins # save it as a vector, so you can use thos as a cut-offs ``` |
## 0% 25% 50% 75% 100% ## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610 |
| ```r # 0% 25% 50% 75% 100% # -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610 |
| crime <- cut(boston_scaled_KS$crim, breaks = bins, labels=c(“low”, “med_low”, “med_high”, “high”), include.lowest = TRUE) #categorical variable of crime summary(crime) #you can see that variable now has 4 even categories (1=0-25%, 2=25%-50%, 3=50%-75%, 4=75%-100%) ``` |
## low med_low med_high high ## 127 126 126 127 |
r # Drop crim and add crime boston_scaled_KS <- dplyr::select(boston_scaled_KS, -crim) #discard the old crim variable using -crim in the scaled Boston data glimpse(boston_scaled_KS) #sanity-check that crim has been deleted |
## Rows: 506 ## Columns: 13 ## $ zn <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, -0.48724019… ## $ indus <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1.3055857, -1… ## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0… ## $ nox <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0.8344581, -0… ## $ rm <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.2273620, 0.20689… ## $ age <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, -0.51067434… ## $ dis <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711351, 1.0766… ## $ rad <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0.7521778, -0… ## $ tax <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1.1050216, -1… ## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.1129203, 0.11… ## $ black <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.4406159, 0.41016… ## $ lstat <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078, -1.0254866… ## $ medv <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1.48603229, 0… |
r boston_scaled_KS <- data.frame(boston_scaled_KS, crime) #add the new crime categorical variable glimpse(boston_scaled_KS) #sanity-check that crime exist in the dataset |
## Rows: 506 ## Columns: 14 ## $ zn <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, -0.48724019… ## $ indus <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1.3055857, -1… ## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0… ## $ nox <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0.8344581, -0… ## $ rm <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.2273620, 0.20689… ## $ age <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, -0.51067434… ## $ dis <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711351, 1.0766… ## $ rad <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0.7521778, -0… ## $ tax <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1.1050216, -1… ## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.1129203, 0.11… ## $ black <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.4406159, 0.41016… ## $ lstat <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078, -1.0254866… ## $ medv <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1.48603229, 0… ## $ crime <fct> low, low, low, low, low, low, med_low, med_low, med_low, med_l… |
| NOTE. |
| dollar sign is replaced with & due RMarkdown syntax |
| EDITED CODE: in the Pre-exercise-code of Exercise 4.5, straight AFTER the line that starts with boston_scaled <- read.table and ends with sep=“,”, header = T), you should ADD ONE LINE: |
| boston_scaled&crime <- factor(boston_scaled&crime, levels = c(“low”, “med_low”, “med_high”, “high”)) |
| ```r # Divide dataset into test (20%) and train (80%) boston_scaled_KS <- read.table(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt”, sep=“,”, header = T) |
| boston_scaled_KS\(crime <- factor(boston_scaled_KS\)crime, levels = c(“low”, “med_low”, “med_high”, “high”)) |
| ind <- sample(nrow(boston_scaled_KS), size = nrow(boston_scaled_KS) * 0.8) #creating a indicator that has 80% of the variables train_4 <- boston_scaled_KS[ind,] #80% of the dataset goes to train dataset test_4 <- boston_scaled_KS[-ind,] #the remaining 20% (-ind) goes to test dataset correct_classes_4 <- test_4$crime #save the correct classes from test data test_4 <- dplyr::select(test_4, -crime) # remove the crime variable from test data ``` |
r glimpse(test_4) #sanity-check: no crime, rows = 102, columns 13 |
## Rows: 102 ## Columns: 13 ## $ zn <dbl> 0.04872402, 0.04872402, -0.48724019, -0.48724019, -0.48724019,… ## $ indus <dbl> -0.4761823, -0.4761823, -0.4368257, -0.4368257, -0.4368257, -0… ## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0… ## $ nox <dbl> -0.2648919, -0.2648919, -0.1440749, -0.1440749, -0.1440749, -0… ## $ rm <dbl> 0.13145938, -0.56308673, -0.64136549, -0.49761722, -0.79365326… ## $ age <dbl> 0.91389483, -1.05066066, -0.42896588, -1.39525719, 0.03286452,… ## $ dis <dbl> 1.2117799501, 0.7863652700, 0.3341187865, 0.3341187865, 0.0006… ## $ rad <dbl> -0.5224844, -0.5224844, -0.6373311, -0.6373311, -0.6373311, -0… ## $ tax <dbl> -0.57694801, -0.57694801, -0.60068166, -0.60068166, -0.6006816… ## $ ptratio <dbl> -1.50374851, -1.50374851, 1.17530274, 1.17530274, 1.17530274, … ## $ black <dbl> 0.3926395, 0.3705134, 0.4265954, 0.3305330, 0.3754425, 0.21793… ## $ lstat <dbl> 1.09184562, 0.42807876, -0.58577611, -0.85044265, -0.19227719,… ## $ medv <dbl> -0.81904111, -0.09055093, -0.28626471, 0.06167090, -0.47110550… |
r glimpse(train_4) #sanity-check: crime, rows = 404, columns 14 |
## Rows: 404 ## Columns: 14 ## $ zn <dbl> -0.4872402, 0.3703025, -0.4872402, -0.4872402, -0.4872402, -0.… ## $ indus <dbl> 1.0149946, -1.0446662, 1.0149946, 1.0149946, 1.0149946, -1.033… ## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0… ## $ nox <dbl> 0.3650828, 0.7965722, 1.0727255, 1.4092873, 1.5991427, -0.3857… ## $ rm <dbl> -0.75095575, 1.75823437, -0.46630571, 0.17984989, -0.04787015,… ## $ age <dbl> 1.11638970, 0.73982029, 1.11638970, 1.11638970, 0.99560328, 0.… ## $ dis <dbl> -1.13123011, -0.78607339, -0.94620939, -0.91947256, -0.7584342… ## $ rad <dbl> 1.6596029, -0.5224844, 1.6596029, 1.6596029, 1.6596029, -0.522… ## $ tax <dbl> 1.52941294, -0.85581834, 1.52941294, 1.52941294, 1.52941294, -… ## $ ptratio <dbl> 0.8057784, -2.5199404, 0.8057784, 0.8057784, 0.8057784, -0.857… ## $ black <dbl> -3.87835651, 0.34718238, -3.72665028, -0.41540159, 0.34882541,… ## $ lstat <dbl> -0.35611838, -0.75521871, 1.11565161, 0.33145447, 0.53030446, … ## $ medv <dbl> -0.81904111, 2.23626845, -1.49316635, -0.63420032, -1.07999281… ## $ crime <fct> high, med_high, high, high, high, med_low, med_low, high, high… |
| ### 5. Linear discriminant analysis (LDA) (0-3 points) |
| - Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables (.) in the dataset as predictor variables. - Draw the LDA (bi)plot. |
| About the method: In general, LDA method is often use in statistics to find separate groups or clusters (minimum 2) based on different characteristics of the data. LDA has continuous independent variable(s) and a categorical dependent variable. Where factor analysis is creating factors based on the similariries in data, LDA creates them based on differences and defines (in)dependent variables, wheres in factor analysis (especially in exploratory factor analysis) the groups are often data-driven. Also, LDA is used when the groups have already defined (in comparison to cluster analysis). See more details Linear discriminant analysis (LDA). |
| Here, we already have defined the groups: low, mid_low, mid_high, high. And we want to see how these groups behave based on the other variables in the data. |
| ### R code |
r lda.fit_train <- lda(crime ~ ., data = train_4) # linear discriminant analysis (LDA), "." indicates all other variables lda.fit_train |
## Call: ## lda(crime ~ ., data = train_4) ## ## Prior probabilities of groups: ## low med_low med_high high ## 0.2549505 0.2450495 0.2500000 0.2500000 ## ## Group means: ## zn indus chas nox rm age ## low 0.8698420 -0.8674413 -0.15765625 -0.8706697 0.37073123 -0.9136636 ## med_low -0.1045942 -0.2204308 -0.03371693 -0.5399207 -0.10761791 -0.2907756 ## med_high -0.3836558 0.1615472 0.23442641 0.3666207 0.09941511 0.3907919 ## high -0.4872402 1.0171306 -0.07742312 1.0737508 -0.37675368 0.8170265 ## dis rad tax ptratio black lstat ## low 0.8586611 -0.6863918 -0.7410673 -0.39472713 0.37788733 -0.76068416 ## med_low 0.3045647 -0.5410455 -0.4377826 -0.04944639 0.30721753 -0.13228837 ## med_high -0.3623956 -0.4485731 -0.3478948 -0.29227586 0.03665772 0.01934235 ## high -0.8433213 1.6379981 1.5139626 0.78062517 -0.77138661 0.81995549 ## medv ## low 0.476322315 ## med_low -0.004335825 ## med_high 0.166094643 ## high -0.674032054 ## ## Coefficients of linear discriminants: ## LD1 LD2 LD3 ## zn 0.09745607 0.67278694 -0.90865277 ## indus 0.11873656 -0.11668076 0.38758009 ## chas -0.12012650 -0.11460714 0.05007341 ## nox 0.30494813 -0.73228469 -1.45370850 ## rm -0.09877828 -0.10264974 -0.08760066 ## age 0.18874223 -0.33121843 0.02197068 ## dis -0.07358444 -0.18930761 0.17603274 ## rad 3.79898225 0.89984531 -0.19270754 ## tax -0.08038869 0.01707639 0.72685751 ## ptratio 0.09900427 0.02759309 -0.30887462 ## black -0.07801827 0.07043170 0.17154693 ## lstat 0.23234829 -0.33063842 0.28791254 ## medv 0.17322855 -0.39962988 -0.26305669 ## ## Proportion of trace: ## LD1 LD2 LD3 ## 0.9607 0.0303 0.0090 |
| Interpretation of the results |
| Interpretention of the results are adapted from Linear Discriminant Analysis in R - finnstats and Discriminant Analysis Essentials in R. |
| - Prior probabilities of groups: All the groups are even size, approx. 25% (based on the quarterlies) - Group means: groups means for each variable based on each category. For example, - In zn (proportion of residential land zoned for lots over 25,000 sq.ft.) all the other groups, expect low crime are getting negative values. Since the data is standardized, the positive values indicate that the values are greater than mean. Meaning that in areas with low crime rate the proportions of residential land zones are bigger in comparison to areas with higher crime rates. - Inmedv (median value of owner-occupied homes in $1000s.), only areas with “high” crime rate had negative values. This means that the areas with high crime rate have on average less owner-occupied homes (median value in 1000s dollars) than other areas. - Coefficients of linear discriminants: first discriminant function (LD1) has all 4 groups (low, mid_low, mid_high, high), second discriminant (LD2) has 3, and LD3 only 2. The coefficient indicate the linear combination of predictors that are used to form the LDA decision rule. For example - LD1: .12zn + .11indus - .06chas - .16nox -.13rm + .25age -.07dis + 3.67rad + .07tax + .15ptratio - .15black + .24lstat + .21*medv - To create the groups (4 clusters) rad (index of accessibility to radial highways) seems to have the most impact (3.67), followed by age (0.26), lstat (lower status of the population (percent), 0.24) and medv (median value of owner-occupied homes in $1000s, 0.21). - Proportion of trace: Percentage separations achieved by the first discriminant function (LD1 = 4 groups) is 95%, LD2 (3 groups) is 4% and LD3 (2groups) 1% |
| Using the function plot() produces biplots of the linear discriminant, obtained by computing LD1 and LD3 for each of the training observations. |
| ```r lda.arrows_train <- function(x, myscale = 1, arrow_heads = 0.1, color = “red”, tex = 0.75, choices = c(1,2)){ heads <- coef(x) arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices[1]], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads) text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3) } |
| classes_4 <- as.numeric(train_4$crime) # target classes as numeric 4 classes: low, mid_low, mid_high, high |
| # plot the lda results plot(lda.fit_train, dimen = 2, col = classes_4, pch = classes_4) lda.arrows_train(lda.fit_train, myscale = 1) ``` |
| Similar to the coefficients of linear discriminants values in previous table, rad (index of accessibility to radial highways) seem to have most impact of differianting the clusters. Based on the plot, the coefficient, and group mean rad values seem to be different between high crime rate and other categories. This means that the rad has most impact when differentiating the clusters. Moreover, having access in radial highways seem to be most associated with high criminal rate. |
| ### 6. Predict LDA model (0-3 points) |
| - Save the crime categories from the test set and remove the variable - Predict the classes with the LDA model on the test data - Cross tabulate the results with the crime categories from the test set - Comment on the results |
| The first two steps have already been done. The test data set should does not contain “crime” variable. |
r # The first two steps have already been done. The test dataset does not contain "crime" variable. glimpse(test_4) #sanity check, no crime |
## Rows: 102 ## Columns: 13 ## $ zn <dbl> 0.04872402, 0.04872402, -0.48724019, -0.48724019, -0.48724019,… ## $ indus <dbl> -0.4761823, -0.4761823, -0.4368257, -0.4368257, -0.4368257, -0… ## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0… ## $ nox <dbl> -0.2648919, -0.2648919, -0.1440749, -0.1440749, -0.1440749, -0… ## $ rm <dbl> 0.13145938, -0.56308673, -0.64136549, -0.49761722, -0.79365326… ## $ age <dbl> 0.91389483, -1.05066066, -0.42896588, -1.39525719, 0.03286452,… ## $ dis <dbl> 1.2117799501, 0.7863652700, 0.3341187865, 0.3341187865, 0.0006… ## $ rad <dbl> -0.5224844, -0.5224844, -0.6373311, -0.6373311, -0.6373311, -0… ## $ tax <dbl> -0.57694801, -0.57694801, -0.60068166, -0.60068166, -0.6006816… ## $ ptratio <dbl> -1.50374851, -1.50374851, 1.17530274, 1.17530274, 1.17530274, … ## $ black <dbl> 0.3926395, 0.3705134, 0.4265954, 0.3305330, 0.3754425, 0.21793… ## $ lstat <dbl> 1.09184562, 0.42807876, -0.58577611, -0.85044265, -0.19227719,… ## $ medv <dbl> -0.81904111, -0.09055093, -0.28626471, 0.06167090, -0.47110550… |
| Implement and predict training model (train_4) for testing data (test_4). |
r lda.pred_test <- predict(lda.fit_train, newdata = test_4) # predict classes with test data table(correct = correct_classes_4, predicted = lda.pred_test$class) # cross tabulate the results |
## predicted ## correct low med_low med_high high ## low 21 3 0 0 ## med_low 6 18 3 0 ## med_high 1 4 17 3 ## high 0 0 0 26 |
| Interpretation of the results |
| Overall, the using test data we manage to predict correct 68 values (67%). |
| - The model seem to predict well especially the high values (high-high = 30 out of 31) - med_high values were predicted correct (med_high - med_high) 13 (48%). 41% was predicted to be med_low (med_high - med_low) - med_low values were predicted correct (med_low - med_low) 15 (60%) and 32% as high (med_low - med_high). - low values were predicted correcr (low - low) 10 (48%) and 43% as med_low (med_low - low). |
| ### 7. K-means (0-4 points) |
| - Reload the Boston dataset and standardize the dataset. - Calculate the distances between the observations. - Run k-means algorithm on the dataset. - Investigate what is the optimal number of clusters and run the algorithm again. - Visualize the clusters using scatterplot - Interpret the results |
| Reload the Boston dataset |
r data("Boston") boston_scl <- as.data.frame(scale(Boston)) st(boston_scl) |
r # Create categorical variable boston_scl$crim <- as.numeric(boston_scl$crim) bins_scl <- quantile(boston_scl$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%) bins_scl # save it as a vector, so you can use thos as a cut-offs |
## 0% 25% 50% 75% 100% ## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610 |
| ```r # 0% 25% 50% 75% 100% # -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610 |
| crime_scl <- cut(boston_scl$crim, breaks = bins_scl, labels=c(“low”, “med_low”, “med_high”, “high”), include.lowest = TRUE) summary(crime_scl) #you can see that variable now has 4 even categories (low=127, med_low=126, med_high=126, high=127) ``` |
## low med_low med_high high ## 127 126 126 127 |
r # Drop crim and add crime boston_scl <- dplyr::select(boston_scl, -crim) #discard the old crim variable using -crim in the scaled Boston data boston_scl <- data.frame(boston_scl, crime_scl) #add the new crime categorical variable st(boston_scl) #sanity-check |
r boston_scl$crime_scl <- factor(boston_scl$crime_scl, levels = c("low", "med_low", "med_high", "high")) |
| Count distances: Euclidean distance matrix is default, to use manhattan distance matrix, specify method=“manhattan” |
r # Distances between the observations # ?dist dist_eu_scl <- dist(boston_scl) |
## Warning in dist(boston_scl): NAs introduced by coercion |
r summary(dist_eu_scl) |
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1394 3.5267 4.9081 4.9394 6.2421 13.0045 |
r dist_man_scl <- dist(boston_scl, method="manhattan") |
## Warning in dist(boston_scl, method = "manhattan"): NAs introduced by coercion |
r summary(dist_man_scl) |
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2849 8.8386 13.0300 13.8657 18.0953 46.8948 |
| NOTE. I got an error code Warning: NAs introduced by coercion . When crime_scl was a factor. When numeric no error message occured. |
| Treat crime_scl as numeric rather than factor. |
| ```r # Distances between the observations boston_scl\(crime_scl <- as.numeric(boston_scl\)crime_scl) |
| dist_eu_scl <- dist(boston_scl) summary(dist_eu_scl) ``` |
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1343 3.5760 4.9401 4.9913 6.3033 12.8856 |
r dist_man_scl <- dist(boston_scl, method="manhattan") summary(dist_man_scl) |
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2645 8.9648 13.2765 14.1297 18.5263 46.5452 |
| Interpretation of the results |
| The top one is the euclidean distance and below manhattan distance. The distance is used to measure the dissimilarity between vectors. dist() command calculates the distance between several vectors (distance between the rows) in a matrix (boston_scl). Larger values indicate that there are greater difference between the rows and therefore, there might be some distinct patterns between different groups. |
| In general, I have never counted distances, so I am not very familiar with this approach. I found this page quite useful to explain it. If I understood correct, the distance is also arbitrary number, meaning it does not have specific scale to use (like in correlation: -1 to 1). If the distance is zero the points are identical (similar), whereas high values indicate little similarity. |
| I also found this page helpful to explain the basics. |
| - In I understoof correctly, euclidean distance is calculated used linear line, whereas manhattan uses grid-like path similar to chessboard and hence gets overall higher numbers. - Based on the summaries, it seems that we have values that are very similar to each other (min = 0.13 and min_manhattan = 0.26), and values that are very different (max = 12.89 and max_manhattan = 46.55). - 1st Qu, Median and Mean are relatively similar values (and sort of 3rd Qu.) that might suggest that most values are relatively close together. - But since max is so much higher (almost double), we might have either outliers or actual groups with larger distances. |
| K-means algorithm NOTE. I used the standardized (boston_scl) dataset, not the raw dataset, like in the exercise. |
| ```r # Investigate what is the optimal number of clusters and run the algorithm again set.seed(123) #This sets the “spot” where the kmeas()-function will start to generate the k-mean clusters. Setting seed assures that we get the same results each time. Otherwise, it would generate a new k-means each time (different starting point). |
| k_max_scl <- 10 #define max amount of clusters twcss_scl <- sapply(1:k_max_scl, function(k){kmeans(boston_scl, k)$tot.withinss}) # calculate the total within sum of square qplot(x = 1:k_max_scl, y = twcss_scl, geom = ‘line’, xlim = c(1,10)) # visualize the results ``` |
| Interpretation of the plot The optimal number of clusters is when the value of total WCSS changes radically. Based on the plot, the most optimal number of clusters would be 2-4 clusters. |
| Next, we are going to generate and plot these models and then choose, which was is the most optimal. |
| Choosing the best model & number of clusters |
r km_2 <- kmeans(boston_scl, centers = 2) # 2 indicates the number of clusters we are generating km_3 <- kmeans(boston_scl, centers = 3) km_4 <- kmeans(boston_scl, centers = 4) # plot the Boston dataset with clusters. Only view the column between 5-10 (nox, rm, age, dis, rad, tax) pairs(boston_scl, col = km_2$cluster) #2 clusters: red and black |
r pairs(boston_scl, col = km_3$cluster) #3 clusters: red, black and green |
r pairs(boston_scl, col = km_4$cluster) #4 clusters: red, black, green and blue |
| Interpretation of the plots |
| We try to find the most optimal number of clusters (minimal overlapping) and detect some variables that seem to have the most distinct clusters. |
| - Two clusters (km_2): There is minimal of overlapping between the 2 clusters. Most distinct differences can be detected for rad, black, lstat, and medv - Three clusters (km_3): You can still group different colours into their own distinc clusters, not so much overlapping. indus, nox, rm, age, rad and tax seem to have most distinct groups. - Four clusters (km_4): when having 4 clusters, it seems that the last cluster (black) is overlapping with other clusters. |
| Next, we can have a look at the same plots, but focuding only rm, age, dis, rad and ptratio variables |
r pairs(boston_scl[5:10], col = km_2$cluster) #2 clusters: red and black |
r pairs(boston_scl[5:10], col = km_3$cluster) #3 clusters: red, black and green |
r pairs(boston_scl[5:10], col = km_4$cluster) #4 clusters: red, black, green and blue |
| Interpretation of the plot |
| - Two clusters (km_2): tax and ptratio seem to have less distinct patterns with other variables. - Three clusters (km_3): tax and ptratio seem to have less distinct patterns with rad - Four clusters (km_4): the 4th clusters (black) is overlapping with other clusters. |
| Therefore, maybe model with 2-3 clusters are the most optimal to describe the data. |
| ### Bonus: to compensate any loss of points from the above exercises (0-2 points) |
| - Perform k-means on the original Boston data with some reasonable number of clusters (> 2). - Remember to standardize the dataset - Then perform LDA using the clusters as target classes. - Include all the variables in the Boston data in the LDA model. - Visualize the results with a biplot including the arrows representing the relationships between original variables and LDA. - Interpret the results. - Which variables are the most influential linear separators for the clusters? |
| Read and standardize |
r data("Boston") boston_bonus <- as.data.frame(scale(Boston)) st(boston_bonus) #sanity check |
r # Create categorical variable boston_bonus$crim <- as.numeric(boston_bonus$crim) bins_b <- quantile(boston_bonus$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%) bins_b # save it as a vector, so you can use those as a cut-offs |
## 0% 25% 50% 75% 100% ## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610 |
r # 0% 25% 50% 75% 100% # -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610 st(boston_bonus) |
r crime_b <- cut(boston_bonus$crim, breaks = bins_b, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE) summary(crime_b) #you can see that variable now has 4 even categories (low=127, med_low=126, med_high=126, high=127) |
## low med_low med_high high ## 127 126 126 127 |
r boston_bonus <- dplyr::select(boston_bonus, -crim) #discard the old crim variable using -crim in the scaled Boston data boston_bonus <- data.frame(boston_bonus, crime_b) #add the new crime categorical variable boston_bonus$crime_b <- factor(boston_bonus$crime_b, levels = c("low", "med_low", "med_high", "high")) st(boston_bonus) |
r #k-means boston_bonus$crime_b <- as.numeric(boston_bonus$crime_b) set.seed(123) k_max <- 10 twcss_b <- sapply(1:k_max, function(k){kmeans(boston_bonus, k)$tot.withinss}) qplot(x = 1:k_max, y = twcss_b, geom = 'line', xlim = c(1,10)) |
r #based on the plot 2-3 clusters are the most optimal |
| Interpretation of the plot |
| Model with 2-3 clusters seem to desrcibe our data the best. |
r bkm_2 <- kmeans(boston_bonus, centers = 2) pairs(boston_bonus, col = bkm_2$cluster) |
r bkm_3 <- kmeans(boston_bonus, centers = 3) pairs(boston_bonus, col = bkm_3$cluster) |
r #based on the plots model with 2 clusters seem to be more optimal. The third cluster seems to get very similar results than the other clusters. |
| Interpretation of the plot |
| - Two clusters (bkm_2): Not so much overlapping, 2 cluster model seem to present our data pretty well. - Three clusters (bkm_3): Some values have quite a lot of overlapping so you dont even see all 2 colours (e.g., zn, chass, ) |
| Therefore, maybe model with 2 clusters is the most optimal to describe our data. But, since we need to have more than 2 clusters based on the guidelines: Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). We run the following model with 3 clusters. |
| Run LDA |
| For some reason when I tried to knit the Assignment 4, this did not work anymore, but instead gave me an error code: Error in x - group.means[g, ] : non-conformable arrays . Unsure, how to fix it. Below is a screenshot of the model, when it did work. |
| Here is the code in blue |
| NOTE. The dollar sign has replaces with &, due to RMarkdown syntax |
| boston_bonus <- data.frame(boston_bonus, crime_b) #add the new crime categorical variable boston_bonus&crime_b <- factor(boston_bonus&crime_b, levels = c(“low”, “med_low”, “med_high”, “high”)) |
| library(MASS) |
| boston_bonus&crime_b <- as.numeric(boston_bonus&crime_b) |
| lda.fit_b <- lda(bkm_3&cluster ~ ., data = boston_bonus) #LDA using the clusters as target classes |
| lda.fit_b |
| Interpretation of the results |
| - Prior probabilities of groups: group1 = 48%, group2=19%, group3=33% - Group means: groups means for each variable based on each category. For example, - zn (proportion of residential land zoned for lots over 25,000 sq.ft.) all the other groups, expect group2 are negative. The positive values indicate that the values are greater than mean. Meaning that in this group the proportions of residential land zones are bigger in comparison to groups. - indus (proportion of non-retail business acres per town), nox (nitrogen oxides concentration, parts per 10 million), and age, lstat (lower status of the population, percent) all had group1 close to 0, group2 = -1 and group3 = 1. This means that these variables might have very distinct patterns based on our clusters and they should be easy to detect. The plot above support this statement. - dis (weighted mean of distances to five Boston employment centres) has the opposite values for group2 (-1) and group3 (1) - crime_b I am unsure how to interpret the crime_b variable though - Coefficients of linear discriminants: first discriminant function (LD1) has 3 clusters and second discriminant (LD2) has 2. The coefficient indicate the linear combination of predictors that are used to form the LDA decision rule. For example - Proportion of trace: Percentage separations achieved by the first discriminant function (LD1 = 3 clusters) is 84% and LD2 (2 groups) 16%. |
| lda.arrows_b <- function(x, myscale = 1, arrow_heads = 0.1, color = “red”, tex = 0.75, choices = c(1,2)){ |
| heads <- coef(x) arrows(x0 = 0, y0 = 0, |
| x1 = myscale * heads[,choices[1]], |
| y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads) |
| text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3) |
| } |
| classes_b <- as.numeric(bkm_3$cluster) # target classes as numeric |
| # plot the lda results plot(lda.fit_b, dimen = 2, col = classes_b, pch = classes_b) |
| lda.arrows_b(lda.fit_b, myscale = 1) |
| Interpretation of the plot |
| zn (proportion of residential land zoned for lots over 25,000 sq.ft.) seem to be the most influential linear separators between the clusters, followd by age, crime, tax and nox. |
| ### Super-Bonus: Plotly() 3D plot (0-3 points) (to compensate any loss of points from the above exercises) |
| - Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points. - Install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below. - Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. - Draw another 3D plot where the color is defined by the clusters of the k-means. - How do the plots differ? Are there any similarities? |
r model_predictors_sb <- dplyr::select(train_4, -crime) #the LD values for the first 3D plot is generated by train_4 data (80%). # check the dimensions dim(model_predictors_sb) |
## [1] 404 13 |
r dim(lda.fit_train$scaling) |
## [1] 13 3 |
r # matrix multiplication matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit_train$scaling matrix_product_sb <- as.data.frame(matrix_product_sb) |
| Install and access plotly and draw 3D plot |
r #install.packages("plotly") #plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers') plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = train_4$crime) |
{=html} <div id="htmlwidget-44474b3e8cd160f85a98" style="width:576px;height:384px;" class="plotly html-widget"></div> <script type="application/json" data-for="htmlwidget-44474b3e8cd160f85a98">{"x":{"visdat":{"d84404812ae":["function () ","plotlyVisDat"]},"cur_data":"d84404812ae","attrs":{"d84404812ae":{"x":[6.92479443469034,-1.76929243990097,7.31220653341853,7.05729956670459,7.01239679473167,-1.95300878786114,-3.04841437436667,6.9632540163712,7.16803451344172,-1.16210438353346,6.29757380890646,7.00912205629887,-1.08839978854359,-1.9310694082109,7.42384607950557,6.88641269240062,7.15030779324755,-1.89969230472143,-1.37956264920308,-0.989970941197395,7.51878633331119,6.81018613456774,-0.800836457735869,-3.50564733640533,-3.15334760238319,-3.07343024078921,-2.28506433987314,-2.47024713908725,7.12556463416392,-3.35560154743073,7.89581295924321,-2.17056341670846,-3.20358425806132,-3.09977195483959,7.09103733325887,-1.85800114642165,7.05007369530946,-0.648129036619982,6.9049696779264,-2.10302531579229,-2.6118753266799,7.43977898909113,7.175626507034,-3.25795106792961,-2.2610502375321,-1.6125577751924,6.87504163695965,-1.64198867009458,6.86404477928609,-2.50395081966598,7.11968867650073,7.07651710978727,-1.63241099348691,-2.49168073685727,-3.03881013996222,-2.71601224464813,6.55243542454266,7.30107952593156,-1.81673256933956,7.64003521293673,-1.78599869600627,-2.4578748014512,-3.85512486175803,-3.72233042095896,-2.42718226749483,-0.941322736574876,-3.212232255526,-3.52868231175967,-2.70031643942833,-1.44640071598716,7.14265312838844,-2.36485110075973,-0.640760921223522,-3.10028849355645,-2.17347087081451,-1.88147326149436,6.96586686865469,-3.17424990011337,7.01000175547194,-0.81933243035809,-2.88406649295359,-1.52969913840919,-2.77530055858516,7.15444139095287,6.78866188540258,-1.94874727092084,7.16977965964232,7.06121701192771,-2.31695406008082,-2.43656131050479,-3.49425384488999,-1.30300725095014,-2.66165237770623,-3.0016395930851,-3.79597577011349,6.77679803480308,-2.68057546523544,-2.42786817522526,-4.28729939548547,-3.41345372797971,6.55193127198326,-1.59097002826539,-3.20255337606677,-3.21769624353851,-1.39805908019807,-1.9327842601142,-2.38273403846818,-3.49382869839315,-1.4916082060715,-3.62793863217678,-2.40968301068152,6.6124833184634,7.1842720942293,-2.89218861409356,-4.65468403047868,-1.34933645901058,-3.22748668306813,-3.79591381229572,6.93774801369599,6.53011012863931,-3.41632597476241,-1.77418749068505,7.90017918607625,-1.656645839113,-3.55191078826805,5.88580671576082,-2.97242969511474,-3.14639492754784,-3.18040835126642,7.05877333053686,-1.36600095798046,-3.67564856196909,6.94079989465125,6.955163803191,-4.0757981427668,7.27198116102046,-2.59226420200108,-1.47660291334234,-3.56946063720633,-2.66095079556534,-2.42441513787009,-3.06471170072102,-2.16247073532,-1.7991527379547,7.27329989740923,-2.21732083868514,-3.48087174227998,-2.34816787498006,-3.26805237499614,6.94175543028646,-2.07878816422805,7.18643275596512,-4.0419678820709,-1.59548511419486,-2.59113595140755,7.29059328236976,-0.389523142087116,-2.47888478754301,-3.26333486269634,-2.04912424827551,7.16701008181298,-2.96843542443139,-3.65776631571092,-1.81543314607997,-0.681903041947926,-3.36359218187229,6.85843396724164,-0.904599648536334,-1.64839893202334,-0.498613886835177,-1.71497704324788,-3.28171463127653,-1.42968490974296,-1.62044761875161,-2.72590347586785,-2.04436416131562,6.63486756716485,-3.85805946757369,-2.48252012726116,-1.50263274147134,-2.55898416233392,-2.67686927525315,-3.34993671972359,-1.09281184147703,6.76772377981268,6.13185460774129,-2.52534772859105,7.17464181033054,-0.367293508330506,7.14771147433998,-3.18715033281531,-4.58865491193033,-2.47041877409088,-2.16491349071616,-1.73502819449116,-4.3726814636257,-0.977241338607186,6.66147119201692,-1.66856877885729,6.31209886765644,7.24538775220127,-2.09499020015907,-2.97751185252054,-1.97687917381721,-1.38952183648591,-2.3087527281106,7.21827155680063,7.30732041449054,7.05616035008232,7.27400867746413,-1.70018626957074,-2.67086861704607,-1.78281844986233,-2.48096759624409,-2.34638003456882,-2.96874522287639,-3.02360940839129,-3.72921291385981,-2.53199571264683,-2.08400915404304,-2.78885502191171,6.9675382502043,6.60189190013015,-2.23490986656451,-1.68789992212259,-2.47592766998599,-2.90044582513607,7.50390933639098,-2.33390625424615,-3.50982745433753,-3.07155219338769,-1.62719074620207,-2.22016215655021,-3.49170183271469,-1.45337744114334,-3.37770703901017,-2.35159406585056,6.61158259553044,-3.89329673941975,-2.26369590757651,-2.10172375901644,-1.3847536974209,-2.60261654334018,-1.10377577401655,-3.82797869180787,-2.7405572410435,-3.43775224599056,6.69153012898656,-2.17675132866371,-4.44424834042072,-2.4614896316362,-2.23326227853333,-1.84007261905849,-3.51921281474623,7.12965304166561,-2.09033145947521,-4.23154624704717,7.42437688063055,-0.7089096807714,-1.99547244683633,-2.81529998207741,-2.39841240569013,-2.79444490456042,-3.82164087170313,-2.56134486395365,-2.25747692053,-3.62888229503155,-3.53686031576448,-4.65628588669786,6.90851053430686,-1.80330924664602,6.66099408073198,-2.45967379293124,-1.87785904082196,-2.45987547993517,-1.89508685124305,-2.35970284837227,-2.98759661868615,6.85837780670015,-3.6281285782464,-3.94609855264269,-1.5075966009686,7.00469315312179,-1.69982540836452,-1.89688561467739,-3.50067058484791,-3.55251624439046,-1.41948259592804,-2.4320780178518,7.63532224060839,7.28794135781978,-3.01906843440461,-1.24571308956746,-1.63044963584432,-1.6005521880762,-2.21051114816131,-3.41835963015704,-1.89113442156731,-1.32631804467857,-2.4633576506278,-2.54716417706923,-3.35142730338706,-2.46498219373519,-2.20737706806954,-3.53355567680532,-1.40262562072168,6.61571507070266,-1.63731093154936,-1.33888601261769,-1.96760643230707,-3.40230915582097,-3.42820207034945,-1.31328811247307,-1.5944574348063,7.35853971855177,-2.89293807071585,7.12356380343593,7.0385583152927,6.09830100726565,-4.60333442050344,-1.72240911590473,-2.1305579749373,7.04899598769931,7.19913061867957,7.02680774481243,-2.99070233077692,6.94088967720957,-2.19403846909222,-0.889842271282933,7.19853446903608,-2.61474632978618,6.45700950945927,-1.69234485104341,7.68242458301653,-3.03977941225215,7.31686577362293,-1.8153691038198,6.69670830412122,-2.96399238057031,6.8912199891701,-2.08895671860367,-4.40122062595089,-1.75782957905972,-0.963442673200675,7.03663762966239,-1.56373472013486,-3.55064856092184,6.7770070571793,-0.884999511498559,-2.383924969079,-2.22880635654271,7.47894598030886,-1.54475395877035,-1.59654299210172,-1.54167188267037,-2.63675394816547,-3.19987440697255,-1.28453107704603,7.21947179985851,-2.00712470209233,-4.77156771469356,-1.78308228072063,-1.35612078244121,-2.84093043181402,-2.48475859964628,7.1404932355848,-0.964390074725863,-2.54894250057949,-2.17324264052225,-1.77550323672971,6.93347095276068,7.23374105633344,-1.06825909021296,-3.41705069379898,-1.68881671771116,-3.71231980955485,-1.59431161831985,-3.2629897049185,-2.98747198559618,-2.6843197111964,-3.89291801477597,-2.94682488491055,-2.45504840445251,6.9607812804498,-4.7289105084523,-0.987696273192545,-4.56550018576793,-4.16175735082408,7.17407424953949,7.41049727017632,-1.41933601803832,7.30537938203583,-2.52205090078371,-2.24498675135583,-3.06893750097301,-2.7966113972952,-1.84405517472037,-3.11144871076551,-2.71828637242305,-2.24383518289473,-0.984060356837622,-2.57577168741856,-0.883670337434012,-4.59796624201324],"y":[0.95273135535027,-1.63161947873313,0.163752109889677,-0.00486206523405996,0.0552364150840169,-0.509270010292348,-1.01747002363211,0.137799542912149,-0.0664751998866932,-0.316230635964758,1.39155013955229,-0.0644591108392509,-0.892008258240348,-1.31960636056619,-0.535849038956007,-1.01404272957593,-0.16107325541107,1.57123842226219,-3.42622400256615,0.984711076125353,0.116826068540651,0.996156106149138,-2.9138681618822,-0.229127755229563,2.97877898573708,0.150411970282781,-0.110716939922152,-1.72881529948432,-0.0699699056052454,0.0727298909403923,-0.168219486338555,-0.696638055116947,0.72424921171105,-0.170395849593443,0.489371061637893,-1.06139957139415,0.157651741456608,-3.05502694243407,0.540333977734427,-1.60932508029201,0.405512807024751,0.0778166137344992,0.299059785407223,2.56451356931623,-1.70261486055837,-0.264812524082603,0.744903143551984,-0.42158291122717,1.16817225979116,0.556384379774058,0.594135741409087,0.199187799889084,-0.657741793826337,-0.469693295695982,0.607711239727536,-1.34970243071556,0.519382182302198,0.500212074232967,-1.20540373502586,0.143727203736186,-1.60582182428198,0.204587979823946,-0.74176080884859,-0.311838972214075,0.214056959387655,-1.07072350055339,2.34309266668541,2.52948258553575,1.50860887244057,-0.523117955502722,-0.233845486465488,0.0774411436839476,-2.88214605168551,2.9229257929761,-0.890317158035016,-1.94442850707882,-0.232484030707015,-0.278220093936303,0.389418594650186,0.712820143566962,1.13803817391935,-1.58348021579426,3.01721771413738,-0.216159700352516,0.695100313365572,-0.281016877051704,0.0118272684009419,-0.220228093704684,0.0130678101426698,-0.594150176833401,-0.03540997553985,-3.38585906125071,-1.46365684682358,2.16570067167416,-0.522835584234477,-0.870450574294164,0.318281149285555,-0.182250029495606,-0.163517343185963,0.190208028085958,-0.66318117674463,0.666055177897204,-1.21664647853885,1.1032955049857,-0.596758492985543,-1.13947102601972,-0.892641860264895,-0.610227979190995,0.813734947532937,-1.14509539678995,2.09911987372579,0.670015197202527,0.121830029645184,0.880595811567945,0.962783988416197,-0.811958421987623,2.79750487375806,-0.517252910741722,0.416349743835017,0.932281181163356,0.98804498844771,-0.549396183654916,-0.124060890138104,-0.340650581009671,-0.0420397739710998,-0.352912509060641,1.38616346697506,2.8948719831721,1.25351809893775,0.271224925600922,-0.67685183075534,0.0879991986133098,0.518704500241918,0.592698987439292,2.63964829756068,-0.14176202221528,0.077745970293495,-0.638641250561509,0.836506016635764,0.107149624155406,-1.49000262475019,2.97226268482977,-1.9180271490669,-1.17740821829981,-0.0289823591601381,0.0401132425114254,0.0421879948146553,-0.0950869536297786,0.403070177813424,0.345432021134951,-0.64716536923915,-0.302635325930159,-0.562269220561461,-1.3420259621234,-1.56036792805585,-0.340831325623901,-3.55483693114026,-0.528642174006532,-0.583358400319553,-0.975663945116622,0.23791961727965,0.77269495321991,-0.259046620790837,-1.50762197251385,-3.06400469824943,0.850375190189461,0.790040832262816,-0.304409957886928,-0.350936014788153,-3.29897992716342,-1.31375725425002,1.03684956122087,1.39600472433571,-1.41647630412127,0.813596382686032,1.50949067451351,0.975247611980585,-0.485649699269625,-0.100545381962068,-0.346970797628606,0.148396049179851,1.54696403427131,1.13260205264322,-0.368200007651646,0.567168224046432,1.37421267816224,-0.620625287525359,-0.0641606177322905,-3.40977312601429,-0.320632155874013,0.618769728072113,0.369780686167431,0.0299265475184135,-0.132961122180317,-1.55059353388897,2.16762568126438,-1.08984250430436,0.44117899903265,-0.60690502297341,1.40601847977881,-0.0558648905947088,-0.155108416515081,2.36337590724161,-1.04103889689091,0.518059110679652,-0.695956545890832,0.0521376318985789,0.078116683144321,0.18629231025261,0.136642009890951,-2.22417577072571,-0.56978519840542,-1.47862863234562,-1.65264419023098,-0.8884701947567,2.49032523875552,-0.787184333678784,-0.567196352749248,0.151001039406995,1.06044750006966,1.34172191885211,0.2588174436987,1.0998886549639,-0.458400224073027,-0.583386808998892,0.214444908207888,0.242067877021668,-0.515945210047121,-0.779769540700784,-0.0347285120027766,-1.21380951286178,-1.48228452967532,-0.394551402180593,0.572217082140371,-0.544042447345757,0.399599158381443,-0.628183606916963,0.903702835061962,-0.134484410832157,1.03242728272474,-0.768700684751002,-0.629645458780947,0.0661797471286705,-2.96165867766026,0.538057489997317,-0.120743262366634,3.22485622034421,-0.530848456369357,-2.81155845483269,1.2001966392275,-1.05255960174706,-1.96219573714475,-0.44712103410027,2.37738723496202,0.299675878128054,1.27289094783724,1.94238717011904,0.604134779264851,-3.11259160754009,-0.978260739854904,-0.266851782742031,1.35387865414653,0.445517650101181,2.99378355501725,1.67702282264461,-0.279217091305749,-1.21278775094862,0.412360910980309,2.65998220619301,-0.245246418752923,-0.59204123329488,1.41205638014682,0.118759521738216,-1.19375282529686,0.0216178333315382,-0.444594508120712,-0.500284747038051,-0.645894703827276,0.457488802621755,2.46759199002379,0.483788900445484,-0.435301033689639,-0.0651575801508174,-1.41801815909067,-0.549590905473963,2.25187894304127,0.991251104919139,1.1060397999793,2.25097197891882,-0.531347358778224,-0.520746947604085,0.683339221165493,1.35171860738546,-1.37721603013961,-1.61962244811287,-0.832024683457384,0.431973079138501,-0.437906979046975,-1.62958135646003,0.0510848331158283,-0.577690917899283,0.566155182246301,-0.905786797366523,1.04921456606211,0.91691569799784,-0.757213444627259,-0.814790441427979,-1.97652969899719,0.583888564220603,1.6818609861465,0.828806308280536,1.04175765571787,0.931521490646662,-1.46678069334414,0.174054437451666,1.3160079324747,0.124604027614805,0.0681151679145779,1.27599094125239,1.30126138254228,-1.33115393552319,1.5089603947676,0.574734409280235,0.0287278476774065,0.470136848697146,-0.959566860473827,0.113033677025294,0.478689742401934,-1.32387701016881,-0.120579975717441,-0.31761972547059,1.31655169026713,-1.48776194332388,-0.135602513032693,1.26341204018124,-0.0809752327792659,-1.48813071198135,0.745879780392878,-0.678250048776581,0.134423865962061,-1.36300157798071,2.38883673615715,0.864837256929574,0.0639080970925194,0.1731139099063,0.915148184038715,0.169290730167617,0.170763639458406,-3.08582373973908,0.133247857509048,-0.959934385629038,-0.22128458660508,-1.52082875458356,-1.34990742443717,-1.20255119397862,0.614831253687175,2.64282223604258,-0.265898085694732,-0.191667964003606,-0.618164259357733,2.01679274471469,-1.41924115894895,-0.784810027149217,2.49643494262723,-1.35782126849696,0.194320643701129,-2.04910239217287,-0.655932924671176,-0.870932738200651,-1.42693373521706,0.471464032265817,0.385039609189856,-0.681466708629906,0.197709014272482,-1.29428723223246,-0.503830963631595,-1.30886139380752,0.0347185653102332,1.36556894228294,1.79030779134942,-0.864153152632234,1.54825064708573,0.316645962656802,0.215383999354583,2.04484281878219,-3.65552096947324,2.44060686826433,2.58528360984114,0.667466438070569,-0.235854690289397,-0.545636023000749,-0.01762555240459,0.137295685413293,-0.784408403560482,1.14377535205525,0.754401605184899,1.09611115010977,1.70582377616065,-1.29444186457983,-2.92129637512291,-0.376325857821407,-0.23081167408616,-0.0286684601134749,1.07393427846073],"z":[0.173507958930287,-2.67830125388609,-0.220465258333285,-0.645327519019081,-0.57005831289299,0.616459896554878,-0.780741438285653,-0.491530500808348,0.405861833351719,-0.118318332391146,1.14456076431397,-0.352113335850177,0.395581541999237,1.9614299409078,-0.896018508310778,-0.968447580260685,-0.735815746102854,0.440081721027855,-2.19044315007205,0.267032309685373,0.625844373213993,1.47485354908811,-2.41714527870799,0.200591413016203,-1.53826730966923,2.1869371353426,-0.0553184856077445,0.272016082888696,-0.285543253636539,0.974078545887799,1.36006442163347,0.248761759238037,1.83492736498382,1.88124598226793,0.50283358078778,-2.2322862486621,0.361109024686378,-2.52487852839761,0.814362984920942,0.967377397144998,1.99609842941219,0.739393246493896,0.226620305388718,-1.87859873368566,0.492691841349714,0.912820904053374,0.547068772332324,0.7463435826331,0.77256793551244,0.303492504960537,1.40641478688249,0.505041755971474,0.71010392171972,0.475840912745016,0.665700635419208,0.11387958225602,0.214426665835588,-0.879843097367963,0.50178051376327,0.586734013041864,-0.108742722440492,-0.300500487651191,0.170230269015476,0.31652785013154,0.191940045053133,1.11564320007672,-0.45740239013973,-2.31288190650569,-0.287147916547337,0.818010116778938,-0.601220515464608,-0.149867467615467,-2.56738439668857,-2.38607093658286,0.373382897645395,-3.13732963953692,-1.67433383552435,1.57710233882142,0.5275833249839,0.32342968372622,-0.0288044219843346,0.199980435264632,-2.54765039429646,-0.725937853133073,1.00697264125383,-1.75070065357483,-0.679840520363655,-1.34886692459286,2.12656773566896,-0.044570681200736,0.527450638446342,-2.60592022014368,0.12639169226497,-3.39928856034585,-0.609449274097644,-0.872357272193804,0.306525724505042,0.188329811411534,-0.69597604223603,1.34941173263846,-1.17612327353257,0.818260047824363,-1.21727272616684,-0.114573363659388,0.104737929362167,0.176254734386535,-0.143427442897162,-0.0967688322157606,1.05749836429055,-0.842202892103678,-0.16015517422689,0.17537776251999,-0.262538573168348,1.80394127634439,-0.432301705524405,0.376364522787893,-1.61239967546788,0.130377243517435,0.285224035576357,1.27922644908295,1.63339175345795,-0.232258115918122,1.27042596728434,0.710710711647655,1.87127095407737,-1.22681733420988,-0.82233733903721,-0.449844270866508,0.0865813449042116,-0.0569455450359458,1.00343619286468,1.79849771973382,1.3481891356414,1.52704910395582,-1.09109518626277,-1.14282389238233,-0.279723703318135,0.261402174070273,-0.85424973509364,-0.0721220214656424,1.67303517582267,-2.28838594223054,0.76807921814998,0.711020594340852,-0.073845339368248,0.348852563975344,0.790192739103407,0.637221061932761,2.017267569797,0.455555284395819,0.910379794724714,-0.609365983592446,-0.145354285597069,2.71784417258831,0.119016654491994,-0.959596841177821,-2.08774948642519,0.192047008938649,-0.740455225861264,0.493125463750977,0.63665196141287,0.120388770609145,-0.105287451590222,0.0496479919457366,-1.98191153785947,1.64816376726816,0.775354730702775,0.0764897630022898,0.764322846738228,-1.69205891152528,-1.96935751757094,0.530059937994777,-0.156437526309414,0.380518798755228,0.0978778304784699,-0.141064747991833,1.36215636965808,-0.036695102225893,-0.228388650863833,-0.121711782565479,0.185928188304308,-1.57984008463282,1.41609125240119,-0.134043122534093,0.975708563373498,1.39526695302502,0.0475575976816548,-1.34482889496996,-1.71320446214249,-1.31322642883527,0.0779912592183085,-0.360225559824714,0.414110260765194,0.702427263641774,0.177659789892266,-1.11592659909136,-0.804424826063794,-0.0833931617270097,0.611937480843653,1.05801234965139,-0.73815639612998,0.388238221322894,-0.473047849805015,0.199759341071406,-0.141378703552411,0.159875089234845,-0.885251787580986,0.321666883406585,-0.495956319357246,-0.675384518683543,-0.589034582542512,-0.157305035467972,0.165786892145642,0.314701480340146,0.0243771859330854,-2.66443508959459,-0.497691199364828,-0.392068674893234,0.379228303785058,0.739568100736119,-0.403325437693944,0.0930642007896773,0.524581451555445,1.4374640559592,0.564589765599985,0.435478736418127,0.785881386714411,-0.93721553363262,0.545751291311456,0.282239787848517,-0.801531259477206,0.553596590326227,0.741598138586343,1.50077506657354,-0.0956502307652168,1.729303269912,-1.90944343154616,0.188829047315349,0.517800163625527,0.52934637014206,0.388581897980018,0.909198996491827,-0.366244799338639,-2.84109484967655,0.819426902753836,-0.0950599520200208,-1.08788805002573,-0.983235371956161,-0.38311943851345,0.700724401390981,0.727364321838376,0.654880334792198,0.436197058725477,0.0897586304627224,-0.131653049055238,0.414914626896442,-1.05463443185915,0.13147451081421,-2.49644845900557,0.609073744121918,2.55419245082415,0.331067360290696,0.185651676731684,-1.99100767375327,0.267780465981547,0.522090539848162,-0.96380426393817,1.22393746498114,-0.542138510925835,-0.51072563065901,0.557617887088536,0.74773465046291,2.20113962017105,-2.27876144829735,0.256838794672639,0.504178013274367,-0.0467892643493847,-0.446174164479694,-0.0801122051816521,-2.58268308530692,0.669149845668088,-0.171437554636399,-0.619148983371202,0.669818091552176,0.386148670537737,-2.58069168356235,-0.350226597552452,-1.39804681453274,-0.111689292396388,-0.776189083446346,-0.671978100391584,0.859095980692473,0.0236570998043008,0.512213633960729,0.329342054403274,0.323933378518671,-0.833400878725531,0.406935102557054,2.91386107132016,0.271350294736458,0.326921158623243,1.77947116470242,0.0389592224841209,0.606989635913978,-0.906171994143057,0.394746954860111,-1.12870161303502,-0.18015575404526,1.2969049568535,0.252629948150208,1.58341556709588,0.293101390377901,-0.141371767341595,0.400382037165432,0.866112547200893,-0.335672756755833,-0.791621925387844,-0.175172879518281,1.30226501834707,0.345279162085749,2.33705201857648,0.433919282224716,0.363153425769134,-0.252916318742727,0.303180426837829,1.08678576850982,-0.442298451861169,-0.284631499584069,-0.895352328137867,-0.862633994716404,-0.13527140686637,1.25891099977589,0.264509694118803,1.1384045642256,-1.10243344389477,-0.781454035630539,0.0562635747451358,0.590307003985251,-0.0655934899482685,-0.435038777836703,-2.06314136283494,-1.11132926739008,0.607318503153174,0.428357654599608,0.103425324076133,-1.73438082656397,1.13733340595552,-0.674145553320738,-2.46922149197885,-0.188169674457099,0.677416644500756,0.128683858047499,0.60110965316707,0.534132625344637,1.51102828882541,2.10293059107999,-0.733414059708051,-0.719816536191311,-0.998057717865307,0.0339689312696631,-0.855968238971975,2.31661345095434,1.04134240646336,-2.30003790859397,1.6918862738574,-0.264800720310152,1.19326485106875,0.103163926002359,0.206090284927721,-2.45231321182487,0.0379549522248451,1.37299997090052,-0.476057615307119,0.741978867550185,0.872551713846899,0.867737142210359,0.943319973823917,1.76918116412115,-0.40890518981598,0.285087431850877,0.0321258620935078,-0.109993937693549,0.280002950690133,0.348606194383015,-0.82491083333247,-1.41529952934896,-1.73261930543094,-1.58450369797042,1.01963326569246,-0.47611923332658,0.107488149644325,-0.0601307631117023,-0.234906379990687,0.236949580528614,0.189162694757061,-0.276837508578492,0.676174061826858,-2.76303724606631,1.30011096481746,-0.39351019465626,-0.0987765517316302,-0.684213155511586,0.511344946703657,-0.245890598027562],"mode":"markers","color":["high","med_high","high","high","high","med_low","med_low","high","high","med_high","med_high","high","med_high","med_low","high","high","high","low","med_high","med_low","high","high","med_high","low","low","low","low","med_low","high","low","high","med_high","med_low","low","high","med_high","high","med_high","high","med_low","med_low","high","high","low","med_high","med_low","high","med_low","high","low","high","high","med_low","med_low","low","low","high","high","med_high","high","med_high","low","low","low","med_low","med_high","low","low","low","med_low","high","low","med_high","low","med_high","med_high","high","med_low","high","med_low","low","med_high","low","high","high","med_high","high","high","med_low","med_high","low","med_high","low","low","med_low","high","med_high","low","low","med_low","high","med_low","low","low","med_low","low","med_high","low","med_low","med_low","low","high","high","low","low","med_high","low","low","high","high","med_low","med_low","high","med_low","low","med_high","low","low","low","high","med_low","low","high","high","low","high","low","med_high","low","low","med_high","low","med_high","med_low","high","med_low","low","med_low","med_low","high","med_low","high","low","med_low","med_low","high","med_high","med_high","low","med_high","high","low","low","med_high","med_high","med_low","high","med_high","med_low","high","med_high","low","med_low","med_high","low","med_low","high","low","med_high","med_low","med_high","low","low","med_high","high","high","med_high","high","med_high","high","med_high","low","low","med_high","med_high","low","med_high","high","med_low","high","high","med_low","low","med_high","med_high","med_high","high","high","high","high","med_high","med_high","med_high","med_low","med_high","low","med_low","low","med_high","med_low","med_low","high","high","med_high","med_low","med_high","med_low","high","med_low","med_low","low","med_high","med_low","low","med_low","low","med_high","med_high","med_low","med_low","med_high","med_low","low","med_high","med_low","med_high","low","high","med_high","low","low","med_low","med_low","low","high","med_low","low","high","med_high","med_high","med_low","med_low","med_low","low","med_low","med_low","low","med_low","low","high","med_high","high","low","med_high","med_low","med_low","low","low","high","low","med_low","med_high","high","med_high","med_low","low","med_low","low","low","high","high","med_low","med_low","med_high","med_high","med_high","med_low","med_low","med_low","med_low","med_high","low","med_high","med_low","low","med_high","high","med_high","med_low","med_low","med_low","low","med_low","med_high","high","med_low","high","high","high","low","med_low","med_low","high","high","high","med_low","high","med_high","med_high","high","med_high","high","med_high","high","med_low","high","med_high","high","low","high","med_high","low","med_low","med_high","high","low","med_low","high","med_high","low","med_high","high","med_low","med_high","med_low","med_low","low","med_high","high","med_low","low","med_low","med_low","low","med_high","high","med_high","med_high","med_high","med_high","high","high","med_high","low","med_high","low","med_high","med_low","med_low","med_low","med_low","low","low","high","low","med_high","low","low","high","high","med_high","high","low","med_high","low","med_low","med_high","low","med_low","med_high","med_high","low","med_high","low"],"alpha_stroke":1,"sizes":[10,100],"spans":[1,20],"type":"scatter3d"}},"layout":{"margin":{"b":40,"l":60,"t":25,"r":10},"scene":{"xaxis":{"title":[]},"yaxis":{"title":[]},"zaxis":{"title":[]}},"hovermode":"closest","showlegend":true},"source":"A","config":{"modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"data":[{"x":[-1.89969230472143,-3.50564733640533,-3.15334760238319,-3.07343024078921,-2.28506433987314,-3.35560154743073,-3.09977195483959,-3.25795106792961,-2.50395081966598,-3.03881013996222,-2.71601224464813,-2.4578748014512,-3.85512486175803,-3.72233042095896,-3.212232255526,-3.52868231175967,-2.70031643942833,-2.36485110075973,-3.10028849355645,-2.88406649295359,-2.77530055858516,-3.49425384488999,-2.66165237770623,-3.0016395930851,-2.42786817522526,-4.28729939548547,-3.20255337606677,-3.21769624353851,-1.9327842601142,-3.49382869839315,-2.40968301068152,-2.89218861409356,-4.65468403047868,-3.22748668306813,-3.79591381229572,-3.55191078826805,-2.97242969511474,-3.14639492754784,-3.18040835126642,-3.67564856196909,-4.0757981427668,-2.59226420200108,-3.56946063720633,-2.66095079556534,-3.06471170072102,-3.48087174227998,-4.0419678820709,-3.26333486269634,-2.96843542443139,-3.65776631571092,-3.28171463127653,-2.72590347586785,-3.85805946757369,-2.67686927525315,-3.34993671972359,-4.58865491193033,-2.47041877409088,-4.3726814636257,-2.97751185252054,-2.96874522287639,-3.72921291385981,-3.07155219338769,-3.49170183271469,-3.37770703901017,-2.60261654334018,-3.43775224599056,-4.44424834042072,-2.4614896316362,-3.51921281474623,-4.23154624704717,-3.82164087170313,-3.62888229503155,-4.65628588669786,-2.45967379293124,-2.35970284837227,-2.98759661868615,-3.6281285782464,-3.50067058484791,-1.41948259592804,-2.4320780178518,-3.35142730338706,-3.53355567680532,-3.42820207034945,-4.60333442050344,-2.96399238057031,-4.40122062595089,-1.56373472013486,-2.383924969079,-3.19987440697255,-4.77156771469356,-2.84093043181402,-3.41705069379898,-3.71231980955485,-2.94682488491055,-2.45504840445251,-4.7289105084523,-4.56550018576793,-4.16175735082408,-2.52205090078371,-3.06893750097301,-3.11144871076551,-2.57577168741856,-4.59796624201324],"y":[1.57123842226219,-0.229127755229563,2.97877898573708,0.150411970282781,-0.110716939922152,0.0727298909403923,-0.170395849593443,2.56451356931623,0.556384379774058,0.607711239727536,-1.34970243071556,0.204587979823946,-0.74176080884859,-0.311838972214075,2.34309266668541,2.52948258553575,1.50860887244057,0.0774411436839476,2.9229257929761,1.13803817391935,3.01721771413738,-0.03540997553985,-1.46365684682358,2.16570067167416,-0.182250029495606,-0.163517343185963,-1.21664647853885,1.1032955049857,-1.13947102601972,-0.610227979190995,2.09911987372579,0.880595811567945,0.962783988416197,2.79750487375806,-0.517252910741722,-0.0420397739710998,1.38616346697506,2.8948719831721,1.25351809893775,0.0879991986133098,2.63964829756068,0.077745970293495,0.836506016635764,0.107149624155406,2.97226268482977,0.0421879948146553,-0.562269220561461,-0.583358400319553,0.77269495321991,-0.259046620790837,1.03684956122087,0.813596382686032,-0.485649699269625,1.54696403427131,1.13260205264322,0.369780686167431,0.0299265475184135,2.16762568126438,2.36337590724161,2.49032523875552,-0.567196352749248,-1.21380951286178,0.572217082140371,0.399599158381443,0.0661797471286705,3.22485622034421,1.2001966392275,-1.05255960174706,2.37738723496202,1.94238717011904,2.99378355501725,-1.21278775094862,2.65998220619301,0.118759521738216,-0.500284747038051,-0.645894703827276,2.46759199002379,2.25187894304127,1.1060397999793,2.25097197891882,0.566155182246301,0.91691569799784,1.04175765571787,1.30126138254228,-0.678250048776581,2.38883673615715,0.915148184038715,0.133247857509048,2.64282223604258,2.01679274471469,2.49643494262723,0.197709014272482,-0.503830963631595,1.54825064708573,0.316645962656802,2.04484281878219,2.44060686826433,2.58528360984114,0.137295685413293,1.14377535205525,1.70582377616065,-0.23081167408616,1.07393427846073],"z":[0.440081721027855,0.200591413016203,-1.53826730966923,2.1869371353426,-0.0553184856077445,0.974078545887799,1.88124598226793,-1.87859873368566,0.303492504960537,0.665700635419208,0.11387958225602,-0.300500487651191,0.170230269015476,0.31652785013154,-0.45740239013973,-2.31288190650569,-0.287147916547337,-0.149867467615467,-2.38607093658286,-0.0288044219843346,-2.54765039429646,0.527450638446342,0.12639169226497,-3.39928856034585,0.188329811411534,-0.69597604223603,-1.21727272616684,-0.114573363659388,0.176254734386535,-0.0967688322157606,-0.16015517422689,1.80394127634439,-0.432301705524405,-1.61239967546788,0.130377243517435,1.87127095407737,-0.82233733903721,-0.449844270866508,0.0865813449042116,1.79849771973382,-1.09109518626277,-0.279723703318135,-0.85424973509364,-0.0721220214656424,-2.28838594223054,0.790192739103407,-0.145354285597069,-0.740455225861264,0.120388770609145,-0.105287451590222,0.530059937994777,0.0978778304784699,-0.036695102225893,-1.57984008463282,1.41609125240119,-0.360225559824714,0.414110260765194,-1.11592659909136,-0.473047849805015,-2.66443508959459,-0.392068674893234,-0.801531259477206,1.50077506657354,1.729303269912,-0.366244799338639,-1.08788805002573,0.700724401390981,0.727364321838376,0.0897586304627224,-1.05463443185915,-1.99100767375327,-0.96380426393817,-0.542138510925835,2.20113962017105,-0.0467892643493847,-0.446174164479694,-2.58268308530692,-2.58069168356235,-1.39804681453274,-0.111689292396388,1.77947116470242,-0.906171994143057,0.293101390377901,0.345279162085749,-0.0655934899482685,-1.11132926739008,-1.73438082656397,-0.188169674457099,-0.733414059708051,-0.855968238971975,-2.30003790859397,0.741978867550185,0.867737142210359,-0.109993937693549,0.280002950690133,-0.82491083333247,-1.73261930543094,-1.58450369797042,-0.234906379990687,0.189162694757061,-2.76303724606631,-0.684213155511586,-0.245890598027562],"mode":"markers","type":"scatter3d","name":"low","marker":{"color":"rgba(102,194,165,1)","line":{"color":"rgba(102,194,165,1)"}},"textfont":{"color":"rgba(102,194,165,1)"},"error_y":{"color":"rgba(102,194,165,1)"},"error_x":{"color":"rgba(102,194,165,1)"},"line":{"color":"rgba(102,194,165,1)"},"frame":null},{"x":[-1.95300878786114,-3.04841437436667,-1.9310694082109,-0.989970941197395,-2.47024713908725,-3.20358425806132,-2.10302531579229,-2.6118753266799,-1.6125577751924,-1.64198867009458,-1.63241099348691,-2.49168073685727,-2.42718226749483,-1.44640071598716,-3.17424990011337,-0.81933243035809,-2.31695406008082,-3.79597577011349,-3.41345372797971,-1.59097002826539,-1.39805908019807,-1.4916082060715,-3.62793863217678,-3.41632597476241,-1.77418749068505,-1.656645839113,-1.36600095798046,-1.7991527379547,-2.21732083868514,-2.34816787498006,-3.26805237499614,-2.07878816422805,-1.59548511419486,-2.59113595140755,-3.36359218187229,-1.64839893202334,-1.42968490974296,-2.04436416131562,-1.50263274147134,-1.66856877885729,-2.09499020015907,-2.48096759624409,-3.02360940839129,-2.08400915404304,-2.78885502191171,-1.68789992212259,-2.90044582513607,-2.33390625424615,-3.50982745433753,-2.22016215655021,-1.45337744114334,-3.89329673941975,-2.26369590757651,-1.3847536974209,-3.82797869180787,-2.23326227853333,-1.84007261905849,-2.09033145947521,-2.81529998207741,-2.39841240569013,-2.79444490456042,-2.56134486395365,-2.25747692053,-3.53686031576448,-2.45987547993517,-1.89508685124305,-3.94609855264269,-1.89688561467739,-3.55251624439046,-3.01906843440461,-1.24571308956746,-3.41835963015704,-1.89113442156731,-1.32631804467857,-2.4633576506278,-2.20737706806954,-1.33888601261769,-1.96760643230707,-3.40230915582097,-1.31328811247307,-2.89293807071585,-1.72240911590473,-2.1305579749373,-2.99070233077692,-3.03977941225215,-1.75782957905972,-3.55064856092184,-1.54475395877035,-1.54167188267037,-2.63675394816547,-2.00712470209233,-1.78308228072063,-1.35612078244121,-3.2629897049185,-2.98747198559618,-2.6843197111964,-3.89291801477597,-2.7966113972952,-2.71828637242305],"y":[-0.509270010292348,-1.01747002363211,-1.31960636056619,0.984711076125353,-1.72881529948432,0.72424921171105,-1.60932508029201,0.405512807024751,-0.264812524082603,-0.42158291122717,-0.657741793826337,-0.469693295695982,0.214056959387655,-0.523117955502722,-0.278220093936303,0.712820143566962,0.0130678101426698,-0.522835584234477,0.190208028085958,0.666055177897204,-0.596758492985543,0.813734947532937,-1.14509539678995,0.98804498844771,-0.549396183654916,-0.340650581009671,-0.67685183075534,-1.17740821829981,0.0401132425114254,-0.0950869536297786,0.403070177813424,-0.64716536923915,-1.3420259621234,-1.56036792805585,0.850375190189461,-0.350936014788153,1.39600472433571,1.50949067451351,-0.346970797628606,-0.60690502297341,-0.155108416515081,-1.65264419023098,-0.787184333678784,1.06044750006966,1.34172191885211,-0.583386808998892,0.242067877021668,-0.779769540700784,-0.0347285120027766,-0.394551402180593,-0.544042447345757,-0.134484410832157,1.03242728272474,-0.629645458780947,0.538057489997317,-1.96219573714475,-0.44712103410027,1.27289094783724,-0.266851782742031,1.35387865414653,0.445517650101181,1.67702282264461,-0.279217091305749,0.412360910980309,0.0216178333315382,-0.444594508120712,0.483788900445484,-0.549590905473963,0.991251104919139,0.683339221165493,1.35171860738546,0.431973079138501,-0.437906979046975,-1.62958135646003,0.0510848331158283,1.04921456606211,0.583888564220603,1.6818609861465,0.828806308280536,0.931521490646662,1.3160079324747,-1.33115393552319,1.5089603947676,-0.959566860473827,1.26341204018124,0.864837256929574,0.169290730167617,-1.52082875458356,-1.20255119397862,0.614831253687175,-0.618164259357733,-1.41924115894895,-0.784810027149217,0.0347185653102332,1.36556894228294,1.79030779134942,-0.864153152632234,0.754401605184899,-1.29444186457983],"z":[0.616459896554878,-0.780741438285653,1.9614299409078,0.267032309685373,0.272016082888696,1.83492736498382,0.967377397144998,1.99609842941219,0.912820904053374,0.7463435826331,0.71010392171972,0.475840912745016,0.191940045053133,0.818010116778938,1.57710233882142,0.32342968372622,2.12656773566896,-0.609449274097644,1.34941173263846,0.818260047824363,0.104737929362167,1.05749836429055,-0.842202892103678,1.63339175345795,-0.232258115918122,0.710710711647655,1.00343619286468,0.711020594340852,0.348852563975344,0.637221061932761,2.017267569797,0.910379794724714,2.71784417258831,0.119016654491994,1.64816376726816,0.764322846738228,-0.156437526309414,-0.141064747991833,-0.121711782565479,0.611937480843653,0.388238221322894,0.314701480340146,-0.497691199364828,0.739568100736119,-0.403325437693944,0.564589765599985,0.785881386714411,0.545751291311456,0.282239787848517,0.741598138586343,-0.0956502307652168,0.517800163625527,0.52934637014206,0.909198996491827,0.819426902753836,0.654880334792198,0.436197058725477,0.414914626896442,2.55419245082415,0.331067360290696,0.185651676731684,0.267780465981547,0.522090539848162,1.22393746498114,0.256838794672639,0.504178013274367,0.669149845668088,0.386148670537737,-0.350226597552452,0.859095980692473,0.0236570998043008,-0.833400878725531,0.406935102557054,2.91386107132016,0.271350294736458,0.606989635913978,1.2969049568535,0.252629948150208,1.58341556709588,-0.141371767341595,-0.335672756755833,2.33705201857648,0.433919282224716,1.08678576850982,-1.10243344389477,0.607318503153174,1.13733340595552,0.60110965316707,1.51102828882541,2.10293059107999,0.0339689312696631,2.31661345095434,1.04134240646336,1.76918116412115,-0.40890518981598,0.285087431850877,0.0321258620935078,-0.276837508578492,1.30011096481746],"mode":"markers","type":"scatter3d","name":"med_low","marker":{"color":"rgba(252,141,98,1)","line":{"color":"rgba(252,141,98,1)"}},"textfont":{"color":"rgba(252,141,98,1)"},"error_y":{"color":"rgba(252,141,98,1)"},"error_x":{"color":"rgba(252,141,98,1)"},"line":{"color":"rgba(252,141,98,1)"},"frame":null},{"x":[-1.76929243990097,-1.16210438353346,6.29757380890646,-1.08839978854359,-1.37956264920308,-0.800836457735869,-2.17056341670846,-1.85800114642165,-0.648129036619982,-2.2610502375321,-1.81673256933956,-1.78599869600627,-0.941322736574876,-0.640760921223522,-2.17347087081451,-1.88147326149436,-1.52969913840919,-1.94874727092084,-2.43656131050479,-1.30300725095014,-2.68057546523544,-2.38273403846818,-1.34933645901058,5.88580671576082,-1.47660291334234,-2.42441513787009,-2.16247073532,-0.389523142087116,-2.47888478754301,-2.04912424827551,-1.81543314607997,-0.681903041947926,-0.904599648536334,-1.71497704324788,-1.62044761875161,-2.48252012726116,-2.55898416233392,-1.09281184147703,-2.52534772859105,-0.367293508330506,-3.18715033281531,-2.16491349071616,-1.73502819449116,-0.977241338607186,-1.97687917381721,-1.38952183648591,-2.3087527281106,-1.70018626957074,-2.67086861704607,-1.78281844986233,-2.34638003456882,-2.53199571264683,-2.23490986656451,-2.47592766998599,-1.62719074620207,-2.35159406585056,6.61158259553044,-2.10172375901644,-1.10377577401655,-2.7405572410435,-2.17675132866371,-0.7089096807714,-1.99547244683633,-1.80330924664602,-1.87785904082196,-1.5075966009686,-1.69982540836452,-1.63044963584432,-1.6005521880762,-2.21051114816131,-2.54716417706923,-2.46498219373519,-1.40262562072168,-1.63731093154936,-1.5944574348063,-2.19403846909222,-0.889842271282933,-2.61474632978618,-1.69234485104341,-1.8153691038198,-2.08895671860367,-0.963442673200675,-0.884999511498559,-2.22880635654271,-1.59654299210172,-1.28453107704603,-2.48475859964628,-0.964390074725863,-2.54894250057949,-2.17324264052225,-1.77550323672971,-1.06825909021296,-1.68881671771116,-1.59431161831985,-0.987696273192545,-1.41933601803832,-2.24498675135583,-1.84405517472037,-2.24383518289473,-0.984060356837622,-0.883670337434012],"y":[-1.63161947873313,-0.316230635964758,1.39155013955229,-0.892008258240348,-3.42622400256615,-2.9138681618822,-0.696638055116947,-1.06139957139415,-3.05502694243407,-1.70261486055837,-1.20540373502586,-1.60582182428198,-1.07072350055339,-2.88214605168551,-0.890317158035016,-1.94442850707882,-1.58348021579426,-0.281016877051704,-0.594150176833401,-3.38585906125071,0.318281149285555,-0.892641860264895,-0.811958421987623,-0.352912509060641,-0.638641250561509,-1.49000262475019,-1.9180271490669,-3.55483693114026,-0.528642174006532,-0.975663945116622,-1.50762197251385,-3.06400469824943,-0.304409957886928,-1.31375725425002,-1.41647630412127,-0.100545381962068,0.148396049179851,-0.368200007651646,-0.620625287525359,-3.40977312601429,0.618769728072113,-0.132961122180317,-1.55059353388897,-1.08984250430436,-1.04103889689091,0.518059110679652,-0.695956545890832,-2.22417577072571,-0.56978519840542,-1.47862863234562,-0.8884701947567,0.151001039406995,-0.458400224073027,0.214444908207888,-1.48228452967532,-0.628183606916963,0.903702835061962,-0.768700684751002,-2.96165867766026,-0.120743262366634,-2.81155845483269,-3.11259160754009,-0.978260739854904,-0.59204123329488,-1.19375282529686,-0.435301033689639,-1.41801815909067,-1.37721603013961,-1.61962244811287,-0.832024683457384,-0.577690917899283,-0.905786797366523,-0.757213444627259,-1.97652969899719,-1.46678069334414,0.478689742401934,-1.32387701016881,-0.31761972547059,-1.48776194332388,-1.48813071198135,-1.36300157798071,0.0639080970925194,-3.08582373973908,-0.959934385629038,-1.34990742443717,-0.265898085694732,-1.35782126849696,-2.04910239217287,-0.655932924671176,-0.870932738200651,-1.42693373521706,-0.681466708629906,-1.29428723223246,-1.30886139380752,-3.65552096947324,-0.545636023000749,-0.784408403560482,1.09611115010977,-2.92129637512291,-0.376325857821407,-0.0286684601134749],"z":[-2.67830125388609,-0.118318332391146,1.14456076431397,0.395581541999237,-2.19044315007205,-2.41714527870799,0.248761759238037,-2.2322862486621,-2.52487852839761,0.492691841349714,0.50178051376327,-0.108742722440492,1.11564320007672,-2.56738439668857,0.373382897645395,-3.13732963953692,0.199980435264632,-1.75070065357483,-0.044570681200736,-2.60592022014368,0.306525724505042,-0.143427442897162,0.376364522787893,-1.22681733420988,0.261402174070273,1.67303517582267,0.76807921814998,-2.08774948642519,0.192047008938649,0.493125463750977,0.0496479919457366,-1.98191153785947,0.0764897630022898,-1.96935751757094,0.380518798755228,-0.228388650863833,0.185928188304308,-0.134043122534093,0.0475575976816548,-1.71320446214249,0.0779912592183085,0.702427263641774,0.177659789892266,-0.804424826063794,0.199759341071406,-0.141378703552411,0.159875089234845,-0.589034582542512,-0.157305035467972,0.165786892145642,0.0243771859330854,0.379228303785058,1.4374640559592,0.435478736418127,0.553596590326227,-1.90944343154616,0.188829047315349,0.388581897980018,-2.84109484967655,-0.0950599520200208,-0.38311943851345,-2.49644845900557,0.609073744121918,0.557617887088536,-2.27876144829735,-0.171437554636399,0.669818091552176,0.512213633960729,0.329342054403274,0.323933378518671,0.326921158623243,0.0389592224841209,0.394746954860111,-0.18015575404526,0.400382037165432,-0.284631499584069,-0.895352328137867,-0.13527140686637,0.264509694118803,0.0562635747451358,-2.06314136283494,0.428357654599608,-2.46922149197885,0.677416644500756,0.534132625344637,-0.719816536191311,1.6918862738574,1.19326485106875,0.103163926002359,0.206090284927721,-2.45231321182487,-0.476057615307119,0.872551713846899,0.943319973823917,-1.41529952934896,0.107488149644325,0.236949580528614,0.676174061826858,-0.39351019465626,-0.0987765517316302,0.511344946703657],"mode":"markers","type":"scatter3d","name":"med_high","marker":{"color":"rgba(141,160,203,1)","line":{"color":"rgba(141,160,203,1)"}},"textfont":{"color":"rgba(141,160,203,1)"},"error_y":{"color":"rgba(141,160,203,1)"},"error_x":{"color":"rgba(141,160,203,1)"},"line":{"color":"rgba(141,160,203,1)"},"frame":null},{"x":[6.92479443469034,7.31220653341853,7.05729956670459,7.01239679473167,6.9632540163712,7.16803451344172,7.00912205629887,7.42384607950557,6.88641269240062,7.15030779324755,7.51878633331119,6.81018613456774,7.12556463416392,7.89581295924321,7.09103733325887,7.05007369530946,6.9049696779264,7.43977898909113,7.175626507034,6.87504163695965,6.86404477928609,7.11968867650073,7.07651710978727,6.55243542454266,7.30107952593156,7.64003521293673,7.14265312838844,6.96586686865469,7.01000175547194,7.15444139095287,6.78866188540258,7.16977965964232,7.06121701192771,6.77679803480308,6.55193127198326,6.6124833184634,7.1842720942293,6.93774801369599,6.53011012863931,7.90017918607625,7.05877333053686,6.94079989465125,6.955163803191,7.27198116102046,7.27329989740923,6.94175543028646,7.18643275596512,7.29059328236976,7.16701008181298,6.85843396724164,-0.498613886835177,6.63486756716485,6.76772377981268,6.13185460774129,7.17464181033054,7.14771147433998,6.66147119201692,6.31209886765644,7.24538775220127,7.21827155680063,7.30732041449054,7.05616035008232,7.27400867746413,6.9675382502043,6.60189190013015,7.50390933639098,6.69153012898656,7.12965304166561,7.42437688063055,6.90851053430686,6.66099408073198,6.85837780670015,7.00469315312179,7.63532224060839,7.28794135781978,6.61571507070266,7.35853971855177,7.12356380343593,7.0385583152927,6.09830100726565,7.04899598769931,7.19913061867957,7.02680774481243,6.94088967720957,7.19853446903608,6.45700950945927,7.68242458301653,7.31686577362293,6.69670830412122,6.8912199891701,7.03663762966239,6.7770070571793,7.47894598030886,7.21947179985851,7.1404932355848,6.93347095276068,7.23374105633344,6.9607812804498,7.17407424953949,7.41049727017632,7.30537938203583],"y":[0.95273135535027,0.163752109889677,-0.00486206523405996,0.0552364150840169,0.137799542912149,-0.0664751998866932,-0.0644591108392509,-0.535849038956007,-1.01404272957593,-0.16107325541107,0.116826068540651,0.996156106149138,-0.0699699056052454,-0.168219486338555,0.489371061637893,0.157651741456608,0.540333977734427,0.0778166137344992,0.299059785407223,0.744903143551984,1.16817225979116,0.594135741409087,0.199187799889084,0.519382182302198,0.500212074232967,0.143727203736186,-0.233845486465488,-0.232484030707015,0.389418594650186,-0.216159700352516,0.695100313365572,0.0118272684009419,-0.220228093704684,-0.870450574294164,-0.66318117674463,0.670015197202527,0.121830029645184,0.416349743835017,0.932281181163356,-0.124060890138104,0.271224925600922,0.518704500241918,0.592698987439292,-0.14176202221528,-0.0289823591601381,0.345432021134951,-0.302635325930159,-0.340831325623901,0.23791961727965,0.790040832262816,-3.29897992716342,0.975247611980585,0.567168224046432,1.37421267816224,-0.0641606177322905,-0.320632155874013,0.44117899903265,1.40601847977881,-0.0558648905947088,0.0521376318985789,0.078116683144321,0.18629231025261,0.136642009890951,0.2588174436987,1.0998886549639,-0.515945210047121,-0.530848456369357,0.299675878128054,0.604134779264851,-0.245246418752923,1.41205638014682,0.457488802621755,-0.0651575801508174,-0.531347358778224,-0.520746947604085,-0.814790441427979,0.174054437451666,0.124604027614805,0.0681151679145779,1.27599094125239,0.574734409280235,0.0287278476774065,0.470136848697146,0.113033677025294,-0.120579975717441,1.31655169026713,-0.135602513032693,-0.0809752327792659,0.745879780392878,0.134423865962061,0.1731139099063,0.170763639458406,-0.22128458660508,-0.191667964003606,0.194320643701129,0.471464032265817,0.385039609189856,0.215383999354583,0.667466438070569,-0.235854690289397,-0.01762555240459],"z":[0.173507958930287,-0.220465258333285,-0.645327519019081,-0.57005831289299,-0.491530500808348,0.405861833351719,-0.352113335850177,-0.896018508310778,-0.968447580260685,-0.735815746102854,0.625844373213993,1.47485354908811,-0.285543253636539,1.36006442163347,0.50283358078778,0.361109024686378,0.814362984920942,0.739393246493896,0.226620305388718,0.547068772332324,0.77256793551244,1.40641478688249,0.505041755971474,0.214426665835588,-0.879843097367963,0.586734013041864,-0.601220515464608,-1.67433383552435,0.5275833249839,-0.725937853133073,1.00697264125383,-0.679840520363655,-1.34886692459286,-0.872357272193804,-1.17612327353257,0.17537776251999,-0.262538573168348,0.285224035576357,1.27922644908295,1.27042596728434,-0.0569455450359458,1.3481891356414,1.52704910395582,-1.14282389238233,-0.073845339368248,0.455555284395819,-0.609365983592446,-0.959596841177821,0.63665196141287,0.775354730702775,-1.69205891152528,1.36215636965808,0.975708563373498,1.39526695302502,-1.34482889496996,-1.31322642883527,-0.0833931617270097,1.05801234965139,-0.73815639612998,-0.885251787580986,0.321666883406585,-0.495956319357246,-0.675384518683543,0.0930642007896773,0.524581451555445,-0.93721553363262,-0.983235371956161,-0.131653049055238,0.13147451081421,-0.51072563065901,0.74773465046291,-0.0801122051816521,-0.619148983371202,-0.776189083446346,-0.671978100391584,-1.12870161303502,0.866112547200893,-0.791621925387844,-0.175172879518281,1.30226501834707,0.363153425769134,-0.252916318742727,0.303180426837829,-0.442298451861169,-0.862633994716404,1.25891099977589,1.1384045642256,-0.781454035630539,0.590307003985251,-0.435038777836703,0.103425324076133,-0.674145553320738,0.128683858047499,-0.998057717865307,-0.264800720310152,0.0379549522248451,1.37299997090052,0.348606194383015,1.01963326569246,-0.47611923332658,-0.0601307631117023],"mode":"markers","type":"scatter3d","name":"high","marker":{"color":"rgba(231,138,195,1)","line":{"color":"rgba(231,138,195,1)"}},"textfont":{"color":"rgba(231,138,195,1)"},"error_y":{"color":"rgba(231,138,195,1)"},"error_x":{"color":"rgba(231,138,195,1)"},"line":{"color":"rgba(231,138,195,1)"},"frame":null}],"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script> |
| - Draw another 3D plot where the color is defined by the clusters of the k-means. - How do the plots differ? Are there any similarities? |
| 3D plot based on k-means The second plot needs to be separated by k-means. However, train_4 data (80%) was not used to calculate k-means. Instead, based on the assignments, we needed to use the complete Boston dataset. The standardized datasets are: |
| - Assignment number 7. K-means, creating k-means and graphs, data = boston_scl, variables = km_2, km_3, km_4 - Assignment BONUS. data = boston_bonus, variables = bkm_2. bkm_3 |
| ```r data(“Boston”) boston_sb <- as.data.frame(scale(Boston)) boston_sb\(crim <- as.numeric(boston_sb\)crim) # Create categorical variable bins_sb <- quantile(boston_sb\(crim) crime_sb <- cut(boston_sb\)crim, breaks = bins_sb, labels=c(“low”, “med_low”, “med_high”, “high”), include.lowest = TRUE) |
| boston_sb <- dplyr::select(boston_sb, -crim) boston_sb <- data.frame(boston_sb, crime_sb) boston_sb\(crime_sb <- as.numeric(boston_sb\)crime_sb) sbkm_3 <- kmeans(boston_sb, centers = 3) boston_sb\(crim <- as.numeric(boston_sb\)crim) lda.fit_sb <- lda(sbkm_3$cluster ~ ., data = boston_sb) #LDA using the clusters as target classes ``` |
## Warning in lda.default(x, grouping, ...): variables are collinear |
r lda.fit_sb |
## Call: ## lda(sbkm_3$cluster ~ ., data = boston_sb) ## ## Prior probabilities of groups: ## 1 2 3 ## 0.1897233 0.4762846 0.3339921 ## ## Group means: ## zn indus chas nox rm age ## 1 1.6956974 -1.0817907 -0.067271764 -1.1173803 0.66096776 -1.31662020 ## 2 -0.3337899 -0.3632658 0.021728211 -0.3404473 0.05360582 -0.03985203 ## 3 -0.4872402 1.1325382 0.007228345 1.1202149 -0.45190478 0.80473300 ## dis rad tax ptratio black lstat medv ## 1 1.40553832 -0.6193863 -0.6437607 -0.6631769 0.3585763 -0.9180096 0.7410061 ## 2 0.03865169 -0.5710917 -0.6216579 -0.1578980 0.2878179 -0.2319868 0.1757696 ## 3 -0.85353099 1.1662378 1.2521927 0.6018841 -0.6141269 0.8522943 -0.6715802 ## crime_sb crim ## 1 1.281250 1.281250 ## 2 2.132780 2.132780 ## 3 3.715976 3.715976 ## ## Coefficients of linear discriminants: ## LD1 LD2 ## zn -0.121076625 1.40439032 ## indus 0.815181218 0.37919114 ## chas -0.053348063 0.02575069 ## nox 1.213770032 0.59654110 ## rm -0.013376446 0.16375801 ## age -0.007550386 -0.49308296 ## dis -0.021956895 0.49261234 ## rad 0.630308088 0.29865470 ## tax 0.886895618 0.64969738 ## ptratio 0.330976614 0.15027688 ## black -0.002217404 -0.04048125 ## lstat 0.260179025 0.22374889 ## medv 0.149859384 0.23550344 ## crime_sb -0.054851160 -0.15575154 ## crim -0.054851160 -0.15575154 ## ## Proportion of trace: ## LD1 LD2 ## 0.8426 0.1574 |
r model_predictors_sb <- dplyr::select(boston_sb, -crime_sb) dim(model_predictors_sb) |
## [1] 506 14 |
r dim(lda.fit_sb$scaling) |
## [1] 15 2 |
r boston_sb$crime_sb <- factor(boston_sb$crime_sb, levels = c("low", "med_low", "med_high", "high")) boston_sb$crime_sb <- as.numeric(boston_sb$crime_sb) # matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit_sb$scaling |
| The last lines matrix_product_sb <- as.matrix(…) gives an ERROR CODE |
| NOTE. The dollar sign has replaces with & and a star with extra % |
| Error in as.matrix(model_predictors_sb) %%% lda.fit_sb&scaling : non-conformable arguments |
| Install and access plotly and draw 3D plot |
r #plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers') plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = boston_sb$clusters) |
{=html} <div id="htmlwidget-2514ce28ae593bde0f73" style="width:576px;height:384px;" class="plotly html-widget"></div> <script type="application/json" data-for="htmlwidget-2514ce28ae593bde0f73">{"x":{"visdat":{"d847cec1447":["function () ","plotlyVisDat"]},"cur_data":"d847cec1447","attrs":{"d847cec1447":{"x":[6.92479443469034,-1.76929243990097,7.31220653341853,7.05729956670459,7.01239679473167,-1.95300878786114,-3.04841437436667,6.9632540163712,7.16803451344172,-1.16210438353346,6.29757380890646,7.00912205629887,-1.08839978854359,-1.9310694082109,7.42384607950557,6.88641269240062,7.15030779324755,-1.89969230472143,-1.37956264920308,-0.989970941197395,7.51878633331119,6.81018613456774,-0.800836457735869,-3.50564733640533,-3.15334760238319,-3.07343024078921,-2.28506433987314,-2.47024713908725,7.12556463416392,-3.35560154743073,7.89581295924321,-2.17056341670846,-3.20358425806132,-3.09977195483959,7.09103733325887,-1.85800114642165,7.05007369530946,-0.648129036619982,6.9049696779264,-2.10302531579229,-2.6118753266799,7.43977898909113,7.175626507034,-3.25795106792961,-2.2610502375321,-1.6125577751924,6.87504163695965,-1.64198867009458,6.86404477928609,-2.50395081966598,7.11968867650073,7.07651710978727,-1.63241099348691,-2.49168073685727,-3.03881013996222,-2.71601224464813,6.55243542454266,7.30107952593156,-1.81673256933956,7.64003521293673,-1.78599869600627,-2.4578748014512,-3.85512486175803,-3.72233042095896,-2.42718226749483,-0.941322736574876,-3.212232255526,-3.52868231175967,-2.70031643942833,-1.44640071598716,7.14265312838844,-2.36485110075973,-0.640760921223522,-3.10028849355645,-2.17347087081451,-1.88147326149436,6.96586686865469,-3.17424990011337,7.01000175547194,-0.81933243035809,-2.88406649295359,-1.52969913840919,-2.77530055858516,7.15444139095287,6.78866188540258,-1.94874727092084,7.16977965964232,7.06121701192771,-2.31695406008082,-2.43656131050479,-3.49425384488999,-1.30300725095014,-2.66165237770623,-3.0016395930851,-3.79597577011349,6.77679803480308,-2.68057546523544,-2.42786817522526,-4.28729939548547,-3.41345372797971,6.55193127198326,-1.59097002826539,-3.20255337606677,-3.21769624353851,-1.39805908019807,-1.9327842601142,-2.38273403846818,-3.49382869839315,-1.4916082060715,-3.62793863217678,-2.40968301068152,6.6124833184634,7.1842720942293,-2.89218861409356,-4.65468403047868,-1.34933645901058,-3.22748668306813,-3.79591381229572,6.93774801369599,6.53011012863931,-3.41632597476241,-1.77418749068505,7.90017918607625,-1.656645839113,-3.55191078826805,5.88580671576082,-2.97242969511474,-3.14639492754784,-3.18040835126642,7.05877333053686,-1.36600095798046,-3.67564856196909,6.94079989465125,6.955163803191,-4.0757981427668,7.27198116102046,-2.59226420200108,-1.47660291334234,-3.56946063720633,-2.66095079556534,-2.42441513787009,-3.06471170072102,-2.16247073532,-1.7991527379547,7.27329989740923,-2.21732083868514,-3.48087174227998,-2.34816787498006,-3.26805237499614,6.94175543028646,-2.07878816422805,7.18643275596512,-4.0419678820709,-1.59548511419486,-2.59113595140755,7.29059328236976,-0.389523142087116,-2.47888478754301,-3.26333486269634,-2.04912424827551,7.16701008181298,-2.96843542443139,-3.65776631571092,-1.81543314607997,-0.681903041947926,-3.36359218187229,6.85843396724164,-0.904599648536334,-1.64839893202334,-0.498613886835177,-1.71497704324788,-3.28171463127653,-1.42968490974296,-1.62044761875161,-2.72590347586785,-2.04436416131562,6.63486756716485,-3.85805946757369,-2.48252012726116,-1.50263274147134,-2.55898416233392,-2.67686927525315,-3.34993671972359,-1.09281184147703,6.76772377981268,6.13185460774129,-2.52534772859105,7.17464181033054,-0.367293508330506,7.14771147433998,-3.18715033281531,-4.58865491193033,-2.47041877409088,-2.16491349071616,-1.73502819449116,-4.3726814636257,-0.977241338607186,6.66147119201692,-1.66856877885729,6.31209886765644,7.24538775220127,-2.09499020015907,-2.97751185252054,-1.97687917381721,-1.38952183648591,-2.3087527281106,7.21827155680063,7.30732041449054,7.05616035008232,7.27400867746413,-1.70018626957074,-2.67086861704607,-1.78281844986233,-2.48096759624409,-2.34638003456882,-2.96874522287639,-3.02360940839129,-3.72921291385981,-2.53199571264683,-2.08400915404304,-2.78885502191171,6.9675382502043,6.60189190013015,-2.23490986656451,-1.68789992212259,-2.47592766998599,-2.90044582513607,7.50390933639098,-2.33390625424615,-3.50982745433753,-3.07155219338769,-1.62719074620207,-2.22016215655021,-3.49170183271469,-1.45337744114334,-3.37770703901017,-2.35159406585056,6.61158259553044,-3.89329673941975,-2.26369590757651,-2.10172375901644,-1.3847536974209,-2.60261654334018,-1.10377577401655,-3.82797869180787,-2.7405572410435,-3.43775224599056,6.69153012898656,-2.17675132866371,-4.44424834042072,-2.4614896316362,-2.23326227853333,-1.84007261905849,-3.51921281474623,7.12965304166561,-2.09033145947521,-4.23154624704717,7.42437688063055,-0.7089096807714,-1.99547244683633,-2.81529998207741,-2.39841240569013,-2.79444490456042,-3.82164087170313,-2.56134486395365,-2.25747692053,-3.62888229503155,-3.53686031576448,-4.65628588669786,6.90851053430686,-1.80330924664602,6.66099408073198,-2.45967379293124,-1.87785904082196,-2.45987547993517,-1.89508685124305,-2.35970284837227,-2.98759661868615,6.85837780670015,-3.6281285782464,-3.94609855264269,-1.5075966009686,7.00469315312179,-1.69982540836452,-1.89688561467739,-3.50067058484791,-3.55251624439046,-1.41948259592804,-2.4320780178518,7.63532224060839,7.28794135781978,-3.01906843440461,-1.24571308956746,-1.63044963584432,-1.6005521880762,-2.21051114816131,-3.41835963015704,-1.89113442156731,-1.32631804467857,-2.4633576506278,-2.54716417706923,-3.35142730338706,-2.46498219373519,-2.20737706806954,-3.53355567680532,-1.40262562072168,6.61571507070266,-1.63731093154936,-1.33888601261769,-1.96760643230707,-3.40230915582097,-3.42820207034945,-1.31328811247307,-1.5944574348063,7.35853971855177,-2.89293807071585,7.12356380343593,7.0385583152927,6.09830100726565,-4.60333442050344,-1.72240911590473,-2.1305579749373,7.04899598769931,7.19913061867957,7.02680774481243,-2.99070233077692,6.94088967720957,-2.19403846909222,-0.889842271282933,7.19853446903608,-2.61474632978618,6.45700950945927,-1.69234485104341,7.68242458301653,-3.03977941225215,7.31686577362293,-1.8153691038198,6.69670830412122,-2.96399238057031,6.8912199891701,-2.08895671860367,-4.40122062595089,-1.75782957905972,-0.963442673200675,7.03663762966239,-1.56373472013486,-3.55064856092184,6.7770070571793,-0.884999511498559,-2.383924969079,-2.22880635654271,7.47894598030886,-1.54475395877035,-1.59654299210172,-1.54167188267037,-2.63675394816547,-3.19987440697255,-1.28453107704603,7.21947179985851,-2.00712470209233,-4.77156771469356,-1.78308228072063,-1.35612078244121,-2.84093043181402,-2.48475859964628,7.1404932355848,-0.964390074725863,-2.54894250057949,-2.17324264052225,-1.77550323672971,6.93347095276068,7.23374105633344,-1.06825909021296,-3.41705069379898,-1.68881671771116,-3.71231980955485,-1.59431161831985,-3.2629897049185,-2.98747198559618,-2.6843197111964,-3.89291801477597,-2.94682488491055,-2.45504840445251,6.9607812804498,-4.7289105084523,-0.987696273192545,-4.56550018576793,-4.16175735082408,7.17407424953949,7.41049727017632,-1.41933601803832,7.30537938203583,-2.52205090078371,-2.24498675135583,-3.06893750097301,-2.7966113972952,-1.84405517472037,-3.11144871076551,-2.71828637242305,-2.24383518289473,-0.984060356837622,-2.57577168741856,-0.883670337434012,-4.59796624201324],"y":[0.95273135535027,-1.63161947873313,0.163752109889677,-0.00486206523405996,0.0552364150840169,-0.509270010292348,-1.01747002363211,0.137799542912149,-0.0664751998866932,-0.316230635964758,1.39155013955229,-0.0644591108392509,-0.892008258240348,-1.31960636056619,-0.535849038956007,-1.01404272957593,-0.16107325541107,1.57123842226219,-3.42622400256615,0.984711076125353,0.116826068540651,0.996156106149138,-2.9138681618822,-0.229127755229563,2.97877898573708,0.150411970282781,-0.110716939922152,-1.72881529948432,-0.0699699056052454,0.0727298909403923,-0.168219486338555,-0.696638055116947,0.72424921171105,-0.170395849593443,0.489371061637893,-1.06139957139415,0.157651741456608,-3.05502694243407,0.540333977734427,-1.60932508029201,0.405512807024751,0.0778166137344992,0.299059785407223,2.56451356931623,-1.70261486055837,-0.264812524082603,0.744903143551984,-0.42158291122717,1.16817225979116,0.556384379774058,0.594135741409087,0.199187799889084,-0.657741793826337,-0.469693295695982,0.607711239727536,-1.34970243071556,0.519382182302198,0.500212074232967,-1.20540373502586,0.143727203736186,-1.60582182428198,0.204587979823946,-0.74176080884859,-0.311838972214075,0.214056959387655,-1.07072350055339,2.34309266668541,2.52948258553575,1.50860887244057,-0.523117955502722,-0.233845486465488,0.0774411436839476,-2.88214605168551,2.9229257929761,-0.890317158035016,-1.94442850707882,-0.232484030707015,-0.278220093936303,0.389418594650186,0.712820143566962,1.13803817391935,-1.58348021579426,3.01721771413738,-0.216159700352516,0.695100313365572,-0.281016877051704,0.0118272684009419,-0.220228093704684,0.0130678101426698,-0.594150176833401,-0.03540997553985,-3.38585906125071,-1.46365684682358,2.16570067167416,-0.522835584234477,-0.870450574294164,0.318281149285555,-0.182250029495606,-0.163517343185963,0.190208028085958,-0.66318117674463,0.666055177897204,-1.21664647853885,1.1032955049857,-0.596758492985543,-1.13947102601972,-0.892641860264895,-0.610227979190995,0.813734947532937,-1.14509539678995,2.09911987372579,0.670015197202527,0.121830029645184,0.880595811567945,0.962783988416197,-0.811958421987623,2.79750487375806,-0.517252910741722,0.416349743835017,0.932281181163356,0.98804498844771,-0.549396183654916,-0.124060890138104,-0.340650581009671,-0.0420397739710998,-0.352912509060641,1.38616346697506,2.8948719831721,1.25351809893775,0.271224925600922,-0.67685183075534,0.0879991986133098,0.518704500241918,0.592698987439292,2.63964829756068,-0.14176202221528,0.077745970293495,-0.638641250561509,0.836506016635764,0.107149624155406,-1.49000262475019,2.97226268482977,-1.9180271490669,-1.17740821829981,-0.0289823591601381,0.0401132425114254,0.0421879948146553,-0.0950869536297786,0.403070177813424,0.345432021134951,-0.64716536923915,-0.302635325930159,-0.562269220561461,-1.3420259621234,-1.56036792805585,-0.340831325623901,-3.55483693114026,-0.528642174006532,-0.583358400319553,-0.975663945116622,0.23791961727965,0.77269495321991,-0.259046620790837,-1.50762197251385,-3.06400469824943,0.850375190189461,0.790040832262816,-0.304409957886928,-0.350936014788153,-3.29897992716342,-1.31375725425002,1.03684956122087,1.39600472433571,-1.41647630412127,0.813596382686032,1.50949067451351,0.975247611980585,-0.485649699269625,-0.100545381962068,-0.346970797628606,0.148396049179851,1.54696403427131,1.13260205264322,-0.368200007651646,0.567168224046432,1.37421267816224,-0.620625287525359,-0.0641606177322905,-3.40977312601429,-0.320632155874013,0.618769728072113,0.369780686167431,0.0299265475184135,-0.132961122180317,-1.55059353388897,2.16762568126438,-1.08984250430436,0.44117899903265,-0.60690502297341,1.40601847977881,-0.0558648905947088,-0.155108416515081,2.36337590724161,-1.04103889689091,0.518059110679652,-0.695956545890832,0.0521376318985789,0.078116683144321,0.18629231025261,0.136642009890951,-2.22417577072571,-0.56978519840542,-1.47862863234562,-1.65264419023098,-0.8884701947567,2.49032523875552,-0.787184333678784,-0.567196352749248,0.151001039406995,1.06044750006966,1.34172191885211,0.2588174436987,1.0998886549639,-0.458400224073027,-0.583386808998892,0.214444908207888,0.242067877021668,-0.515945210047121,-0.779769540700784,-0.0347285120027766,-1.21380951286178,-1.48228452967532,-0.394551402180593,0.572217082140371,-0.544042447345757,0.399599158381443,-0.628183606916963,0.903702835061962,-0.134484410832157,1.03242728272474,-0.768700684751002,-0.629645458780947,0.0661797471286705,-2.96165867766026,0.538057489997317,-0.120743262366634,3.22485622034421,-0.530848456369357,-2.81155845483269,1.2001966392275,-1.05255960174706,-1.96219573714475,-0.44712103410027,2.37738723496202,0.299675878128054,1.27289094783724,1.94238717011904,0.604134779264851,-3.11259160754009,-0.978260739854904,-0.266851782742031,1.35387865414653,0.445517650101181,2.99378355501725,1.67702282264461,-0.279217091305749,-1.21278775094862,0.412360910980309,2.65998220619301,-0.245246418752923,-0.59204123329488,1.41205638014682,0.118759521738216,-1.19375282529686,0.0216178333315382,-0.444594508120712,-0.500284747038051,-0.645894703827276,0.457488802621755,2.46759199002379,0.483788900445484,-0.435301033689639,-0.0651575801508174,-1.41801815909067,-0.549590905473963,2.25187894304127,0.991251104919139,1.1060397999793,2.25097197891882,-0.531347358778224,-0.520746947604085,0.683339221165493,1.35171860738546,-1.37721603013961,-1.61962244811287,-0.832024683457384,0.431973079138501,-0.437906979046975,-1.62958135646003,0.0510848331158283,-0.577690917899283,0.566155182246301,-0.905786797366523,1.04921456606211,0.91691569799784,-0.757213444627259,-0.814790441427979,-1.97652969899719,0.583888564220603,1.6818609861465,0.828806308280536,1.04175765571787,0.931521490646662,-1.46678069334414,0.174054437451666,1.3160079324747,0.124604027614805,0.0681151679145779,1.27599094125239,1.30126138254228,-1.33115393552319,1.5089603947676,0.574734409280235,0.0287278476774065,0.470136848697146,-0.959566860473827,0.113033677025294,0.478689742401934,-1.32387701016881,-0.120579975717441,-0.31761972547059,1.31655169026713,-1.48776194332388,-0.135602513032693,1.26341204018124,-0.0809752327792659,-1.48813071198135,0.745879780392878,-0.678250048776581,0.134423865962061,-1.36300157798071,2.38883673615715,0.864837256929574,0.0639080970925194,0.1731139099063,0.915148184038715,0.169290730167617,0.170763639458406,-3.08582373973908,0.133247857509048,-0.959934385629038,-0.22128458660508,-1.52082875458356,-1.34990742443717,-1.20255119397862,0.614831253687175,2.64282223604258,-0.265898085694732,-0.191667964003606,-0.618164259357733,2.01679274471469,-1.41924115894895,-0.784810027149217,2.49643494262723,-1.35782126849696,0.194320643701129,-2.04910239217287,-0.655932924671176,-0.870932738200651,-1.42693373521706,0.471464032265817,0.385039609189856,-0.681466708629906,0.197709014272482,-1.29428723223246,-0.503830963631595,-1.30886139380752,0.0347185653102332,1.36556894228294,1.79030779134942,-0.864153152632234,1.54825064708573,0.316645962656802,0.215383999354583,2.04484281878219,-3.65552096947324,2.44060686826433,2.58528360984114,0.667466438070569,-0.235854690289397,-0.545636023000749,-0.01762555240459,0.137295685413293,-0.784408403560482,1.14377535205525,0.754401605184899,1.09611115010977,1.70582377616065,-1.29444186457983,-2.92129637512291,-0.376325857821407,-0.23081167408616,-0.0286684601134749,1.07393427846073],"z":[0.173507958930287,-2.67830125388609,-0.220465258333285,-0.645327519019081,-0.57005831289299,0.616459896554878,-0.780741438285653,-0.491530500808348,0.405861833351719,-0.118318332391146,1.14456076431397,-0.352113335850177,0.395581541999237,1.9614299409078,-0.896018508310778,-0.968447580260685,-0.735815746102854,0.440081721027855,-2.19044315007205,0.267032309685373,0.625844373213993,1.47485354908811,-2.41714527870799,0.200591413016203,-1.53826730966923,2.1869371353426,-0.0553184856077445,0.272016082888696,-0.285543253636539,0.974078545887799,1.36006442163347,0.248761759238037,1.83492736498382,1.88124598226793,0.50283358078778,-2.2322862486621,0.361109024686378,-2.52487852839761,0.814362984920942,0.967377397144998,1.99609842941219,0.739393246493896,0.226620305388718,-1.87859873368566,0.492691841349714,0.912820904053374,0.547068772332324,0.7463435826331,0.77256793551244,0.303492504960537,1.40641478688249,0.505041755971474,0.71010392171972,0.475840912745016,0.665700635419208,0.11387958225602,0.214426665835588,-0.879843097367963,0.50178051376327,0.586734013041864,-0.108742722440492,-0.300500487651191,0.170230269015476,0.31652785013154,0.191940045053133,1.11564320007672,-0.45740239013973,-2.31288190650569,-0.287147916547337,0.818010116778938,-0.601220515464608,-0.149867467615467,-2.56738439668857,-2.38607093658286,0.373382897645395,-3.13732963953692,-1.67433383552435,1.57710233882142,0.5275833249839,0.32342968372622,-0.0288044219843346,0.199980435264632,-2.54765039429646,-0.725937853133073,1.00697264125383,-1.75070065357483,-0.679840520363655,-1.34886692459286,2.12656773566896,-0.044570681200736,0.527450638446342,-2.60592022014368,0.12639169226497,-3.39928856034585,-0.609449274097644,-0.872357272193804,0.306525724505042,0.188329811411534,-0.69597604223603,1.34941173263846,-1.17612327353257,0.818260047824363,-1.21727272616684,-0.114573363659388,0.104737929362167,0.176254734386535,-0.143427442897162,-0.0967688322157606,1.05749836429055,-0.842202892103678,-0.16015517422689,0.17537776251999,-0.262538573168348,1.80394127634439,-0.432301705524405,0.376364522787893,-1.61239967546788,0.130377243517435,0.285224035576357,1.27922644908295,1.63339175345795,-0.232258115918122,1.27042596728434,0.710710711647655,1.87127095407737,-1.22681733420988,-0.82233733903721,-0.449844270866508,0.0865813449042116,-0.0569455450359458,1.00343619286468,1.79849771973382,1.3481891356414,1.52704910395582,-1.09109518626277,-1.14282389238233,-0.279723703318135,0.261402174070273,-0.85424973509364,-0.0721220214656424,1.67303517582267,-2.28838594223054,0.76807921814998,0.711020594340852,-0.073845339368248,0.348852563975344,0.790192739103407,0.637221061932761,2.017267569797,0.455555284395819,0.910379794724714,-0.609365983592446,-0.145354285597069,2.71784417258831,0.119016654491994,-0.959596841177821,-2.08774948642519,0.192047008938649,-0.740455225861264,0.493125463750977,0.63665196141287,0.120388770609145,-0.105287451590222,0.0496479919457366,-1.98191153785947,1.64816376726816,0.775354730702775,0.0764897630022898,0.764322846738228,-1.69205891152528,-1.96935751757094,0.530059937994777,-0.156437526309414,0.380518798755228,0.0978778304784699,-0.141064747991833,1.36215636965808,-0.036695102225893,-0.228388650863833,-0.121711782565479,0.185928188304308,-1.57984008463282,1.41609125240119,-0.134043122534093,0.975708563373498,1.39526695302502,0.0475575976816548,-1.34482889496996,-1.71320446214249,-1.31322642883527,0.0779912592183085,-0.360225559824714,0.414110260765194,0.702427263641774,0.177659789892266,-1.11592659909136,-0.804424826063794,-0.0833931617270097,0.611937480843653,1.05801234965139,-0.73815639612998,0.388238221322894,-0.473047849805015,0.199759341071406,-0.141378703552411,0.159875089234845,-0.885251787580986,0.321666883406585,-0.495956319357246,-0.675384518683543,-0.589034582542512,-0.157305035467972,0.165786892145642,0.314701480340146,0.0243771859330854,-2.66443508959459,-0.497691199364828,-0.392068674893234,0.379228303785058,0.739568100736119,-0.403325437693944,0.0930642007896773,0.524581451555445,1.4374640559592,0.564589765599985,0.435478736418127,0.785881386714411,-0.93721553363262,0.545751291311456,0.282239787848517,-0.801531259477206,0.553596590326227,0.741598138586343,1.50077506657354,-0.0956502307652168,1.729303269912,-1.90944343154616,0.188829047315349,0.517800163625527,0.52934637014206,0.388581897980018,0.909198996491827,-0.366244799338639,-2.84109484967655,0.819426902753836,-0.0950599520200208,-1.08788805002573,-0.983235371956161,-0.38311943851345,0.700724401390981,0.727364321838376,0.654880334792198,0.436197058725477,0.0897586304627224,-0.131653049055238,0.414914626896442,-1.05463443185915,0.13147451081421,-2.49644845900557,0.609073744121918,2.55419245082415,0.331067360290696,0.185651676731684,-1.99100767375327,0.267780465981547,0.522090539848162,-0.96380426393817,1.22393746498114,-0.542138510925835,-0.51072563065901,0.557617887088536,0.74773465046291,2.20113962017105,-2.27876144829735,0.256838794672639,0.504178013274367,-0.0467892643493847,-0.446174164479694,-0.0801122051816521,-2.58268308530692,0.669149845668088,-0.171437554636399,-0.619148983371202,0.669818091552176,0.386148670537737,-2.58069168356235,-0.350226597552452,-1.39804681453274,-0.111689292396388,-0.776189083446346,-0.671978100391584,0.859095980692473,0.0236570998043008,0.512213633960729,0.329342054403274,0.323933378518671,-0.833400878725531,0.406935102557054,2.91386107132016,0.271350294736458,0.326921158623243,1.77947116470242,0.0389592224841209,0.606989635913978,-0.906171994143057,0.394746954860111,-1.12870161303502,-0.18015575404526,1.2969049568535,0.252629948150208,1.58341556709588,0.293101390377901,-0.141371767341595,0.400382037165432,0.866112547200893,-0.335672756755833,-0.791621925387844,-0.175172879518281,1.30226501834707,0.345279162085749,2.33705201857648,0.433919282224716,0.363153425769134,-0.252916318742727,0.303180426837829,1.08678576850982,-0.442298451861169,-0.284631499584069,-0.895352328137867,-0.862633994716404,-0.13527140686637,1.25891099977589,0.264509694118803,1.1384045642256,-1.10243344389477,-0.781454035630539,0.0562635747451358,0.590307003985251,-0.0655934899482685,-0.435038777836703,-2.06314136283494,-1.11132926739008,0.607318503153174,0.428357654599608,0.103425324076133,-1.73438082656397,1.13733340595552,-0.674145553320738,-2.46922149197885,-0.188169674457099,0.677416644500756,0.128683858047499,0.60110965316707,0.534132625344637,1.51102828882541,2.10293059107999,-0.733414059708051,-0.719816536191311,-0.998057717865307,0.0339689312696631,-0.855968238971975,2.31661345095434,1.04134240646336,-2.30003790859397,1.6918862738574,-0.264800720310152,1.19326485106875,0.103163926002359,0.206090284927721,-2.45231321182487,0.0379549522248451,1.37299997090052,-0.476057615307119,0.741978867550185,0.872551713846899,0.867737142210359,0.943319973823917,1.76918116412115,-0.40890518981598,0.285087431850877,0.0321258620935078,-0.109993937693549,0.280002950690133,0.348606194383015,-0.82491083333247,-1.41529952934896,-1.73261930543094,-1.58450369797042,1.01963326569246,-0.47611923332658,0.107488149644325,-0.0601307631117023,-0.234906379990687,0.236949580528614,0.189162694757061,-0.276837508578492,0.676174061826858,-2.76303724606631,1.30011096481746,-0.39351019465626,-0.0987765517316302,-0.684213155511586,0.511344946703657,-0.245890598027562],"mode":"markers","alpha_stroke":1,"sizes":[10,100],"spans":[1,20],"type":"scatter3d"}},"layout":{"margin":{"b":40,"l":60,"t":25,"r":10},"scene":{"xaxis":{"title":[]},"yaxis":{"title":[]},"zaxis":{"title":[]}},"hovermode":"closest","showlegend":false},"source":"A","config":{"modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"data":[{"x":[6.92479443469034,-1.76929243990097,7.31220653341853,7.05729956670459,7.01239679473167,-1.95300878786114,-3.04841437436667,6.9632540163712,7.16803451344172,-1.16210438353346,6.29757380890646,7.00912205629887,-1.08839978854359,-1.9310694082109,7.42384607950557,6.88641269240062,7.15030779324755,-1.89969230472143,-1.37956264920308,-0.989970941197395,7.51878633331119,6.81018613456774,-0.800836457735869,-3.50564733640533,-3.15334760238319,-3.07343024078921,-2.28506433987314,-2.47024713908725,7.12556463416392,-3.35560154743073,7.89581295924321,-2.17056341670846,-3.20358425806132,-3.09977195483959,7.09103733325887,-1.85800114642165,7.05007369530946,-0.648129036619982,6.9049696779264,-2.10302531579229,-2.6118753266799,7.43977898909113,7.175626507034,-3.25795106792961,-2.2610502375321,-1.6125577751924,6.87504163695965,-1.64198867009458,6.86404477928609,-2.50395081966598,7.11968867650073,7.07651710978727,-1.63241099348691,-2.49168073685727,-3.03881013996222,-2.71601224464813,6.55243542454266,7.30107952593156,-1.81673256933956,7.64003521293673,-1.78599869600627,-2.4578748014512,-3.85512486175803,-3.72233042095896,-2.42718226749483,-0.941322736574876,-3.212232255526,-3.52868231175967,-2.70031643942833,-1.44640071598716,7.14265312838844,-2.36485110075973,-0.640760921223522,-3.10028849355645,-2.17347087081451,-1.88147326149436,6.96586686865469,-3.17424990011337,7.01000175547194,-0.81933243035809,-2.88406649295359,-1.52969913840919,-2.77530055858516,7.15444139095287,6.78866188540258,-1.94874727092084,7.16977965964232,7.06121701192771,-2.31695406008082,-2.43656131050479,-3.49425384488999,-1.30300725095014,-2.66165237770623,-3.0016395930851,-3.79597577011349,6.77679803480308,-2.68057546523544,-2.42786817522526,-4.28729939548547,-3.41345372797971,6.55193127198326,-1.59097002826539,-3.20255337606677,-3.21769624353851,-1.39805908019807,-1.9327842601142,-2.38273403846818,-3.49382869839315,-1.4916082060715,-3.62793863217678,-2.40968301068152,6.6124833184634,7.1842720942293,-2.89218861409356,-4.65468403047868,-1.34933645901058,-3.22748668306813,-3.79591381229572,6.93774801369599,6.53011012863931,-3.41632597476241,-1.77418749068505,7.90017918607625,-1.656645839113,-3.55191078826805,5.88580671576082,-2.97242969511474,-3.14639492754784,-3.18040835126642,7.05877333053686,-1.36600095798046,-3.67564856196909,6.94079989465125,6.955163803191,-4.0757981427668,7.27198116102046,-2.59226420200108,-1.47660291334234,-3.56946063720633,-2.66095079556534,-2.42441513787009,-3.06471170072102,-2.16247073532,-1.7991527379547,7.27329989740923,-2.21732083868514,-3.48087174227998,-2.34816787498006,-3.26805237499614,6.94175543028646,-2.07878816422805,7.18643275596512,-4.0419678820709,-1.59548511419486,-2.59113595140755,7.29059328236976,-0.389523142087116,-2.47888478754301,-3.26333486269634,-2.04912424827551,7.16701008181298,-2.96843542443139,-3.65776631571092,-1.81543314607997,-0.681903041947926,-3.36359218187229,6.85843396724164,-0.904599648536334,-1.64839893202334,-0.498613886835177,-1.71497704324788,-3.28171463127653,-1.42968490974296,-1.62044761875161,-2.72590347586785,-2.04436416131562,6.63486756716485,-3.85805946757369,-2.48252012726116,-1.50263274147134,-2.55898416233392,-2.67686927525315,-3.34993671972359,-1.09281184147703,6.76772377981268,6.13185460774129,-2.52534772859105,7.17464181033054,-0.367293508330506,7.14771147433998,-3.18715033281531,-4.58865491193033,-2.47041877409088,-2.16491349071616,-1.73502819449116,-4.3726814636257,-0.977241338607186,6.66147119201692,-1.66856877885729,6.31209886765644,7.24538775220127,-2.09499020015907,-2.97751185252054,-1.97687917381721,-1.38952183648591,-2.3087527281106,7.21827155680063,7.30732041449054,7.05616035008232,7.27400867746413,-1.70018626957074,-2.67086861704607,-1.78281844986233,-2.48096759624409,-2.34638003456882,-2.96874522287639,-3.02360940839129,-3.72921291385981,-2.53199571264683,-2.08400915404304,-2.78885502191171,6.9675382502043,6.60189190013015,-2.23490986656451,-1.68789992212259,-2.47592766998599,-2.90044582513607,7.50390933639098,-2.33390625424615,-3.50982745433753,-3.07155219338769,-1.62719074620207,-2.22016215655021,-3.49170183271469,-1.45337744114334,-3.37770703901017,-2.35159406585056,6.61158259553044,-3.89329673941975,-2.26369590757651,-2.10172375901644,-1.3847536974209,-2.60261654334018,-1.10377577401655,-3.82797869180787,-2.7405572410435,-3.43775224599056,6.69153012898656,-2.17675132866371,-4.44424834042072,-2.4614896316362,-2.23326227853333,-1.84007261905849,-3.51921281474623,7.12965304166561,-2.09033145947521,-4.23154624704717,7.42437688063055,-0.7089096807714,-1.99547244683633,-2.81529998207741,-2.39841240569013,-2.79444490456042,-3.82164087170313,-2.56134486395365,-2.25747692053,-3.62888229503155,-3.53686031576448,-4.65628588669786,6.90851053430686,-1.80330924664602,6.66099408073198,-2.45967379293124,-1.87785904082196,-2.45987547993517,-1.89508685124305,-2.35970284837227,-2.98759661868615,6.85837780670015,-3.6281285782464,-3.94609855264269,-1.5075966009686,7.00469315312179,-1.69982540836452,-1.89688561467739,-3.50067058484791,-3.55251624439046,-1.41948259592804,-2.4320780178518,7.63532224060839,7.28794135781978,-3.01906843440461,-1.24571308956746,-1.63044963584432,-1.6005521880762,-2.21051114816131,-3.41835963015704,-1.89113442156731,-1.32631804467857,-2.4633576506278,-2.54716417706923,-3.35142730338706,-2.46498219373519,-2.20737706806954,-3.53355567680532,-1.40262562072168,6.61571507070266,-1.63731093154936,-1.33888601261769,-1.96760643230707,-3.40230915582097,-3.42820207034945,-1.31328811247307,-1.5944574348063,7.35853971855177,-2.89293807071585,7.12356380343593,7.0385583152927,6.09830100726565,-4.60333442050344,-1.72240911590473,-2.1305579749373,7.04899598769931,7.19913061867957,7.02680774481243,-2.99070233077692,6.94088967720957,-2.19403846909222,-0.889842271282933,7.19853446903608,-2.61474632978618,6.45700950945927,-1.69234485104341,7.68242458301653,-3.03977941225215,7.31686577362293,-1.8153691038198,6.69670830412122,-2.96399238057031,6.8912199891701,-2.08895671860367,-4.40122062595089,-1.75782957905972,-0.963442673200675,7.03663762966239,-1.56373472013486,-3.55064856092184,6.7770070571793,-0.884999511498559,-2.383924969079,-2.22880635654271,7.47894598030886,-1.54475395877035,-1.59654299210172,-1.54167188267037,-2.63675394816547,-3.19987440697255,-1.28453107704603,7.21947179985851,-2.00712470209233,-4.77156771469356,-1.78308228072063,-1.35612078244121,-2.84093043181402,-2.48475859964628,7.1404932355848,-0.964390074725863,-2.54894250057949,-2.17324264052225,-1.77550323672971,6.93347095276068,7.23374105633344,-1.06825909021296,-3.41705069379898,-1.68881671771116,-3.71231980955485,-1.59431161831985,-3.2629897049185,-2.98747198559618,-2.6843197111964,-3.89291801477597,-2.94682488491055,-2.45504840445251,6.9607812804498,-4.7289105084523,-0.987696273192545,-4.56550018576793,-4.16175735082408,7.17407424953949,7.41049727017632,-1.41933601803832,7.30537938203583,-2.52205090078371,-2.24498675135583,-3.06893750097301,-2.7966113972952,-1.84405517472037,-3.11144871076551,-2.71828637242305,-2.24383518289473,-0.984060356837622,-2.57577168741856,-0.883670337434012,-4.59796624201324],"y":[0.95273135535027,-1.63161947873313,0.163752109889677,-0.00486206523405996,0.0552364150840169,-0.509270010292348,-1.01747002363211,0.137799542912149,-0.0664751998866932,-0.316230635964758,1.39155013955229,-0.0644591108392509,-0.892008258240348,-1.31960636056619,-0.535849038956007,-1.01404272957593,-0.16107325541107,1.57123842226219,-3.42622400256615,0.984711076125353,0.116826068540651,0.996156106149138,-2.9138681618822,-0.229127755229563,2.97877898573708,0.150411970282781,-0.110716939922152,-1.72881529948432,-0.0699699056052454,0.0727298909403923,-0.168219486338555,-0.696638055116947,0.72424921171105,-0.170395849593443,0.489371061637893,-1.06139957139415,0.157651741456608,-3.05502694243407,0.540333977734427,-1.60932508029201,0.405512807024751,0.0778166137344992,0.299059785407223,2.56451356931623,-1.70261486055837,-0.264812524082603,0.744903143551984,-0.42158291122717,1.16817225979116,0.556384379774058,0.594135741409087,0.199187799889084,-0.657741793826337,-0.469693295695982,0.607711239727536,-1.34970243071556,0.519382182302198,0.500212074232967,-1.20540373502586,0.143727203736186,-1.60582182428198,0.204587979823946,-0.74176080884859,-0.311838972214075,0.214056959387655,-1.07072350055339,2.34309266668541,2.52948258553575,1.50860887244057,-0.523117955502722,-0.233845486465488,0.0774411436839476,-2.88214605168551,2.9229257929761,-0.890317158035016,-1.94442850707882,-0.232484030707015,-0.278220093936303,0.389418594650186,0.712820143566962,1.13803817391935,-1.58348021579426,3.01721771413738,-0.216159700352516,0.695100313365572,-0.281016877051704,0.0118272684009419,-0.220228093704684,0.0130678101426698,-0.594150176833401,-0.03540997553985,-3.38585906125071,-1.46365684682358,2.16570067167416,-0.522835584234477,-0.870450574294164,0.318281149285555,-0.182250029495606,-0.163517343185963,0.190208028085958,-0.66318117674463,0.666055177897204,-1.21664647853885,1.1032955049857,-0.596758492985543,-1.13947102601972,-0.892641860264895,-0.610227979190995,0.813734947532937,-1.14509539678995,2.09911987372579,0.670015197202527,0.121830029645184,0.880595811567945,0.962783988416197,-0.811958421987623,2.79750487375806,-0.517252910741722,0.416349743835017,0.932281181163356,0.98804498844771,-0.549396183654916,-0.124060890138104,-0.340650581009671,-0.0420397739710998,-0.352912509060641,1.38616346697506,2.8948719831721,1.25351809893775,0.271224925600922,-0.67685183075534,0.0879991986133098,0.518704500241918,0.592698987439292,2.63964829756068,-0.14176202221528,0.077745970293495,-0.638641250561509,0.836506016635764,0.107149624155406,-1.49000262475019,2.97226268482977,-1.9180271490669,-1.17740821829981,-0.0289823591601381,0.0401132425114254,0.0421879948146553,-0.0950869536297786,0.403070177813424,0.345432021134951,-0.64716536923915,-0.302635325930159,-0.562269220561461,-1.3420259621234,-1.56036792805585,-0.340831325623901,-3.55483693114026,-0.528642174006532,-0.583358400319553,-0.975663945116622,0.23791961727965,0.77269495321991,-0.259046620790837,-1.50762197251385,-3.06400469824943,0.850375190189461,0.790040832262816,-0.304409957886928,-0.350936014788153,-3.29897992716342,-1.31375725425002,1.03684956122087,1.39600472433571,-1.41647630412127,0.813596382686032,1.50949067451351,0.975247611980585,-0.485649699269625,-0.100545381962068,-0.346970797628606,0.148396049179851,1.54696403427131,1.13260205264322,-0.368200007651646,0.567168224046432,1.37421267816224,-0.620625287525359,-0.0641606177322905,-3.40977312601429,-0.320632155874013,0.618769728072113,0.369780686167431,0.0299265475184135,-0.132961122180317,-1.55059353388897,2.16762568126438,-1.08984250430436,0.44117899903265,-0.60690502297341,1.40601847977881,-0.0558648905947088,-0.155108416515081,2.36337590724161,-1.04103889689091,0.518059110679652,-0.695956545890832,0.0521376318985789,0.078116683144321,0.18629231025261,0.136642009890951,-2.22417577072571,-0.56978519840542,-1.47862863234562,-1.65264419023098,-0.8884701947567,2.49032523875552,-0.787184333678784,-0.567196352749248,0.151001039406995,1.06044750006966,1.34172191885211,0.2588174436987,1.0998886549639,-0.458400224073027,-0.583386808998892,0.214444908207888,0.242067877021668,-0.515945210047121,-0.779769540700784,-0.0347285120027766,-1.21380951286178,-1.48228452967532,-0.394551402180593,0.572217082140371,-0.544042447345757,0.399599158381443,-0.628183606916963,0.903702835061962,-0.134484410832157,1.03242728272474,-0.768700684751002,-0.629645458780947,0.0661797471286705,-2.96165867766026,0.538057489997317,-0.120743262366634,3.22485622034421,-0.530848456369357,-2.81155845483269,1.2001966392275,-1.05255960174706,-1.96219573714475,-0.44712103410027,2.37738723496202,0.299675878128054,1.27289094783724,1.94238717011904,0.604134779264851,-3.11259160754009,-0.978260739854904,-0.266851782742031,1.35387865414653,0.445517650101181,2.99378355501725,1.67702282264461,-0.279217091305749,-1.21278775094862,0.412360910980309,2.65998220619301,-0.245246418752923,-0.59204123329488,1.41205638014682,0.118759521738216,-1.19375282529686,0.0216178333315382,-0.444594508120712,-0.500284747038051,-0.645894703827276,0.457488802621755,2.46759199002379,0.483788900445484,-0.435301033689639,-0.0651575801508174,-1.41801815909067,-0.549590905473963,2.25187894304127,0.991251104919139,1.1060397999793,2.25097197891882,-0.531347358778224,-0.520746947604085,0.683339221165493,1.35171860738546,-1.37721603013961,-1.61962244811287,-0.832024683457384,0.431973079138501,-0.437906979046975,-1.62958135646003,0.0510848331158283,-0.577690917899283,0.566155182246301,-0.905786797366523,1.04921456606211,0.91691569799784,-0.757213444627259,-0.814790441427979,-1.97652969899719,0.583888564220603,1.6818609861465,0.828806308280536,1.04175765571787,0.931521490646662,-1.46678069334414,0.174054437451666,1.3160079324747,0.124604027614805,0.0681151679145779,1.27599094125239,1.30126138254228,-1.33115393552319,1.5089603947676,0.574734409280235,0.0287278476774065,0.470136848697146,-0.959566860473827,0.113033677025294,0.478689742401934,-1.32387701016881,-0.120579975717441,-0.31761972547059,1.31655169026713,-1.48776194332388,-0.135602513032693,1.26341204018124,-0.0809752327792659,-1.48813071198135,0.745879780392878,-0.678250048776581,0.134423865962061,-1.36300157798071,2.38883673615715,0.864837256929574,0.0639080970925194,0.1731139099063,0.915148184038715,0.169290730167617,0.170763639458406,-3.08582373973908,0.133247857509048,-0.959934385629038,-0.22128458660508,-1.52082875458356,-1.34990742443717,-1.20255119397862,0.614831253687175,2.64282223604258,-0.265898085694732,-0.191667964003606,-0.618164259357733,2.01679274471469,-1.41924115894895,-0.784810027149217,2.49643494262723,-1.35782126849696,0.194320643701129,-2.04910239217287,-0.655932924671176,-0.870932738200651,-1.42693373521706,0.471464032265817,0.385039609189856,-0.681466708629906,0.197709014272482,-1.29428723223246,-0.503830963631595,-1.30886139380752,0.0347185653102332,1.36556894228294,1.79030779134942,-0.864153152632234,1.54825064708573,0.316645962656802,0.215383999354583,2.04484281878219,-3.65552096947324,2.44060686826433,2.58528360984114,0.667466438070569,-0.235854690289397,-0.545636023000749,-0.01762555240459,0.137295685413293,-0.784408403560482,1.14377535205525,0.754401605184899,1.09611115010977,1.70582377616065,-1.29444186457983,-2.92129637512291,-0.376325857821407,-0.23081167408616,-0.0286684601134749,1.07393427846073],"z":[0.173507958930287,-2.67830125388609,-0.220465258333285,-0.645327519019081,-0.57005831289299,0.616459896554878,-0.780741438285653,-0.491530500808348,0.405861833351719,-0.118318332391146,1.14456076431397,-0.352113335850177,0.395581541999237,1.9614299409078,-0.896018508310778,-0.968447580260685,-0.735815746102854,0.440081721027855,-2.19044315007205,0.267032309685373,0.625844373213993,1.47485354908811,-2.41714527870799,0.200591413016203,-1.53826730966923,2.1869371353426,-0.0553184856077445,0.272016082888696,-0.285543253636539,0.974078545887799,1.36006442163347,0.248761759238037,1.83492736498382,1.88124598226793,0.50283358078778,-2.2322862486621,0.361109024686378,-2.52487852839761,0.814362984920942,0.967377397144998,1.99609842941219,0.739393246493896,0.226620305388718,-1.87859873368566,0.492691841349714,0.912820904053374,0.547068772332324,0.7463435826331,0.77256793551244,0.303492504960537,1.40641478688249,0.505041755971474,0.71010392171972,0.475840912745016,0.665700635419208,0.11387958225602,0.214426665835588,-0.879843097367963,0.50178051376327,0.586734013041864,-0.108742722440492,-0.300500487651191,0.170230269015476,0.31652785013154,0.191940045053133,1.11564320007672,-0.45740239013973,-2.31288190650569,-0.287147916547337,0.818010116778938,-0.601220515464608,-0.149867467615467,-2.56738439668857,-2.38607093658286,0.373382897645395,-3.13732963953692,-1.67433383552435,1.57710233882142,0.5275833249839,0.32342968372622,-0.0288044219843346,0.199980435264632,-2.54765039429646,-0.725937853133073,1.00697264125383,-1.75070065357483,-0.679840520363655,-1.34886692459286,2.12656773566896,-0.044570681200736,0.527450638446342,-2.60592022014368,0.12639169226497,-3.39928856034585,-0.609449274097644,-0.872357272193804,0.306525724505042,0.188329811411534,-0.69597604223603,1.34941173263846,-1.17612327353257,0.818260047824363,-1.21727272616684,-0.114573363659388,0.104737929362167,0.176254734386535,-0.143427442897162,-0.0967688322157606,1.05749836429055,-0.842202892103678,-0.16015517422689,0.17537776251999,-0.262538573168348,1.80394127634439,-0.432301705524405,0.376364522787893,-1.61239967546788,0.130377243517435,0.285224035576357,1.27922644908295,1.63339175345795,-0.232258115918122,1.27042596728434,0.710710711647655,1.87127095407737,-1.22681733420988,-0.82233733903721,-0.449844270866508,0.0865813449042116,-0.0569455450359458,1.00343619286468,1.79849771973382,1.3481891356414,1.52704910395582,-1.09109518626277,-1.14282389238233,-0.279723703318135,0.261402174070273,-0.85424973509364,-0.0721220214656424,1.67303517582267,-2.28838594223054,0.76807921814998,0.711020594340852,-0.073845339368248,0.348852563975344,0.790192739103407,0.637221061932761,2.017267569797,0.455555284395819,0.910379794724714,-0.609365983592446,-0.145354285597069,2.71784417258831,0.119016654491994,-0.959596841177821,-2.08774948642519,0.192047008938649,-0.740455225861264,0.493125463750977,0.63665196141287,0.120388770609145,-0.105287451590222,0.0496479919457366,-1.98191153785947,1.64816376726816,0.775354730702775,0.0764897630022898,0.764322846738228,-1.69205891152528,-1.96935751757094,0.530059937994777,-0.156437526309414,0.380518798755228,0.0978778304784699,-0.141064747991833,1.36215636965808,-0.036695102225893,-0.228388650863833,-0.121711782565479,0.185928188304308,-1.57984008463282,1.41609125240119,-0.134043122534093,0.975708563373498,1.39526695302502,0.0475575976816548,-1.34482889496996,-1.71320446214249,-1.31322642883527,0.0779912592183085,-0.360225559824714,0.414110260765194,0.702427263641774,0.177659789892266,-1.11592659909136,-0.804424826063794,-0.0833931617270097,0.611937480843653,1.05801234965139,-0.73815639612998,0.388238221322894,-0.473047849805015,0.199759341071406,-0.141378703552411,0.159875089234845,-0.885251787580986,0.321666883406585,-0.495956319357246,-0.675384518683543,-0.589034582542512,-0.157305035467972,0.165786892145642,0.314701480340146,0.0243771859330854,-2.66443508959459,-0.497691199364828,-0.392068674893234,0.379228303785058,0.739568100736119,-0.403325437693944,0.0930642007896773,0.524581451555445,1.4374640559592,0.564589765599985,0.435478736418127,0.785881386714411,-0.93721553363262,0.545751291311456,0.282239787848517,-0.801531259477206,0.553596590326227,0.741598138586343,1.50077506657354,-0.0956502307652168,1.729303269912,-1.90944343154616,0.188829047315349,0.517800163625527,0.52934637014206,0.388581897980018,0.909198996491827,-0.366244799338639,-2.84109484967655,0.819426902753836,-0.0950599520200208,-1.08788805002573,-0.983235371956161,-0.38311943851345,0.700724401390981,0.727364321838376,0.654880334792198,0.436197058725477,0.0897586304627224,-0.131653049055238,0.414914626896442,-1.05463443185915,0.13147451081421,-2.49644845900557,0.609073744121918,2.55419245082415,0.331067360290696,0.185651676731684,-1.99100767375327,0.267780465981547,0.522090539848162,-0.96380426393817,1.22393746498114,-0.542138510925835,-0.51072563065901,0.557617887088536,0.74773465046291,2.20113962017105,-2.27876144829735,0.256838794672639,0.504178013274367,-0.0467892643493847,-0.446174164479694,-0.0801122051816521,-2.58268308530692,0.669149845668088,-0.171437554636399,-0.619148983371202,0.669818091552176,0.386148670537737,-2.58069168356235,-0.350226597552452,-1.39804681453274,-0.111689292396388,-0.776189083446346,-0.671978100391584,0.859095980692473,0.0236570998043008,0.512213633960729,0.329342054403274,0.323933378518671,-0.833400878725531,0.406935102557054,2.91386107132016,0.271350294736458,0.326921158623243,1.77947116470242,0.0389592224841209,0.606989635913978,-0.906171994143057,0.394746954860111,-1.12870161303502,-0.18015575404526,1.2969049568535,0.252629948150208,1.58341556709588,0.293101390377901,-0.141371767341595,0.400382037165432,0.866112547200893,-0.335672756755833,-0.791621925387844,-0.175172879518281,1.30226501834707,0.345279162085749,2.33705201857648,0.433919282224716,0.363153425769134,-0.252916318742727,0.303180426837829,1.08678576850982,-0.442298451861169,-0.284631499584069,-0.895352328137867,-0.862633994716404,-0.13527140686637,1.25891099977589,0.264509694118803,1.1384045642256,-1.10243344389477,-0.781454035630539,0.0562635747451358,0.590307003985251,-0.0655934899482685,-0.435038777836703,-2.06314136283494,-1.11132926739008,0.607318503153174,0.428357654599608,0.103425324076133,-1.73438082656397,1.13733340595552,-0.674145553320738,-2.46922149197885,-0.188169674457099,0.677416644500756,0.128683858047499,0.60110965316707,0.534132625344637,1.51102828882541,2.10293059107999,-0.733414059708051,-0.719816536191311,-0.998057717865307,0.0339689312696631,-0.855968238971975,2.31661345095434,1.04134240646336,-2.30003790859397,1.6918862738574,-0.264800720310152,1.19326485106875,0.103163926002359,0.206090284927721,-2.45231321182487,0.0379549522248451,1.37299997090052,-0.476057615307119,0.741978867550185,0.872551713846899,0.867737142210359,0.943319973823917,1.76918116412115,-0.40890518981598,0.285087431850877,0.0321258620935078,-0.109993937693549,0.280002950690133,0.348606194383015,-0.82491083333247,-1.41529952934896,-1.73261930543094,-1.58450369797042,1.01963326569246,-0.47611923332658,0.107488149644325,-0.0601307631117023,-0.234906379990687,0.236949580528614,0.189162694757061,-0.276837508578492,0.676174061826858,-2.76303724606631,1.30011096481746,-0.39351019465626,-0.0987765517316302,-0.684213155511586,0.511344946703657,-0.245890598027562],"mode":"markers","type":"scatter3d","marker":{"color":"rgba(31,119,180,1)","line":{"color":"rgba(31,119,180,1)"}},"error_y":{"color":"rgba(31,119,180,1)"},"error_x":{"color":"rgba(31,119,180,1)"},"line":{"color":"rgba(31,119,180,1)"},"frame":null}],"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script> |
| Interpret results |
| Unsure, why it is not working. Any suggestions? |
| ## 2. Data wrangling (max 5 points) This code is for the next week’s data! |
| See create_human.R and human.csv |
| End of assignment 4 |
| *** |
r date() |
## [1] "Mon Dec 12 11:04:32 2022" |
| # Assignment 5: Tasks and Instructions |
| ## 1. Data wrangling (max 5 points) |
| See create_human.R and human_final.csv at data folder. |
| ## 2. Analysis (max 15 points) |
| First we install/use R packages we need to complete the assignment. |
r #install.packages('ggplot2') library(ggplot2) library(vtable) library(MASS) library(tidyr) library(dplyr) library(corrplot) library(GGally) library(plotly) library(tidyr) |
| ### 1. Graphical overview and summaries of the data. Describe and interpret the distributions and the relationships. (0-3 points) |
r human_ch5<- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep=",", header=T) View(human_ch5) #human chapter 5 |
r st(human_ch5) |
| The human2 has 155 observations (rows, participants) and data has 8 variables (columns): |
| - Edu2.FM: Ratio of female and male populations with secondary education - Labo.FM: Ratio of labor force participation of females and males - Edu.Exp: Expected Years of Education - Life.Exp: Life Expectancy at Birth - GNI: Gross National Income (GNI) per Capita - Mat.Mor: Maternal Mortality Ratio - Ado.Birth: Adolescent Birth Rate - Parli.F: Percent Representation in Parliament (of females I assume) |
r gg = ggpairs(human_ch5, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size=3)), title="Graphical overview of human_ch5 data") gg1 = gg + theme(axis.text = element_text(size=5), strip.text.x = element_text(size = 5), strip.text.y = element_text(size=5)) gg1 |
| Distributions: |
| - Normally distributed: Edu2.FM, Labo.FM, Edu.Exp, Parl.F. Labo.FM is a bit skewed on right and Parl.F a bit on left. Depending how the ratio (Labo.FM) has been defined there are either more males or females in labour force participation. Based on the common knowledge about gender difference in labour I assume there are more males. Simialrly, if the Parli.F is the ratio of females in parliment, there are less females than males in parliaments. - Exponential distribution: GNI, Mat.Mor and Ado.Birth have more like exponential distributions. |
| Correlations: / Most correlations were positive and based on the scatterplots they seem to have linear relations (except some variables with exponential distributions) |
| - Highest correlation: Higest correlation (-.857) was detected between Live.Exp and Mat.Mor. This means that the countries that have high life expectancy at birth have low maternal mortality ratio, and vice versa. This makes sense, since it could indicate the health care system status in each country. - Edu2.FM: seem to have moderate positive correlation with Edu.Exp, Life.Exp and GNI, along with moderate negative with Mat.Mor and Ado.Birth. No correlation between Labo.FM and Parli.F. - Labor.FM: has only small positive correlation between Mat.Mor and Parli.F - Edu.Exp: has moderate positive correlation with Life.Exp and GNI, and weak Parli.F, along with moderate negative correlation with Mat.Mor and Ado.Birth. - Life.Exp: has strong and weak positive correlation between GNI and Parli.F, respectively.As well as, strong negative correlation between Mat.Mor and Ado.Birth - GNI: moderate negative correlation with Mat.Mor and Ado.Birth - Mat.Mor: strong positive correlation with Ado.Birth |
| Overall, Maternal mortality ratio and adolescent birth rate seem to most likely have negative associations to other variables. Meaning when mortality or adolescence birth rate is high it is associated with lower life expectancy, female ratio of 2nd education, a total of years in education (in country), and GNI for each country, and vice versa. Similarly, GNI, Edu2.FM, Edu.Exp, Life.Exp are positively associated with each other. In countries where people live longer, spend more time in school, have higher/more event female/male 2nd education ratio have often high GNI. However, ratio of labor force between female and male, as well as the female representative in parliament does not seem to have that great impact on other variables. |
| ### 2. Principal component analysis (PCA) on the raw (non-standardized) human data. (0-2 points) |
| Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. |
| Interpret the results of both analysis (with and without standardizing). |
| - Are the results different? - Why or why not? - Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. - Which number affects what? |
r pca_human_ch5 <- prcomp(human_ch5) # perform principal component analysis, PCA (with the SVD method) summary(human_ch5) |
## Edu2.FM Labo.FM Edu.Exp Life.Exp ## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 ## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 ## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 ## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 ## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 ## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 ## GNI Mat.Mor Ado.Birth Parli.F ## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00 ## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40 ## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30 ## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91 ## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95 ## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50 |
| This is the “raw” un-standardized data. You see that the range in values between the variables differ a lot (min-max), so it is very hard to draw any conclusions or interpreted the findings. |
r s2 <- summary(pca_human_ch5) pca_pr2 <- round(100*s2$importance[2, ], digits =1) # rounded percentanges of variance captured by each PC print(pca_pr2) # print out the percentages of variance |
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 ## 100 0 0 0 0 0 0 0 |
r # create object pc_lab to be used as axis labels pc_lab2 = paste0(names(pca_pr2), " (", pca_pr2, "%)") pc_lab2 |
## [1] "PC1 (100%)" "PC2 (0%)" "PC3 (0%)" "PC4 (0%)" "PC5 (0%)" ## [6] "PC6 (0%)" "PC7 (0%)" "PC8 (0%)" |
r biplot(pca_human_ch5, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2]) |
| ``` ## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = ## arrow.len): zero-length arrow is of indeterminate angle and so skipped |
| ## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = ## arrow.len): zero-length arrow is of indeterminate angle and so skipped |
| ## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = ## arrow.len): zero-length arrow is of indeterminate angle and so skipped |
| ## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = ## arrow.len): zero-length arrow is of indeterminate angle and so skipped |
| ## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = ## arrow.len): zero-length arrow is of indeterminate angle and so skipped ``` |
| ### 3. Standardize human data and repeat the above analysis. (0-4 points) |
r human_ch5_std <- scale(human_ch5) # standardize the data summary(human_ch5_std) # summaries |
## Edu2.FM Labo.FM Edu.Exp Life.Exp ## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188 ## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425 ## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717 ## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218 ## GNI Mat.Mor Ado.Birth Parli.F ## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203 ## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409 ## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127 ## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850 |
| When the data is standardized (M=0, SD=1) the comparison between the variables becomes easier. You can still see that there is huge difference between the countries (max-min) in GNI and Maternal Mortality Ratio (Mat.Mor). |
r pca_human_ch5_std <- prcomp(human_ch5_std) s3 <- summary(pca_human_ch5_std) pca_pr2_std <- round(100*s3$importance[2, ], digits =1) # rounded percentanges of variance captured by each PC print(pca_pr2_std) # print out the percentages of variance |
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 ## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3 |
| ```r pc_lab2_std = paste0(names(pca_pr2_std), ” (“, pca_pr2_std,”%)“) |
| #PCA biplot(pca_human_ch5_std, cex = c(0.8, 1), col = c(“grey40”, “deeppink2”), xlab = pc_lab2_std[1], ylab = pc_lab2_std[2]) ``` |
| ### 4. Interpretations of the first two principal component dimensions PCA standardized human data (biplot). (0-2 points) |
| Interpret the results of both analysis (with and without standardizing). |
| - Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. - Are the results different? - Why or why not? - Which number affects what? |
| Instructions based on the exercise: |
| A biplot is a way of visualizing the connections between two representations of the same data. First, a simple scatter plot is drawn where the observations are represented by two principal components (PC’s). Then, arrows are drawn to visualize the connections between the original variables and the PC’s. The following connections hold: |
| - The angle between the arrows can be interpreted as the correlation between the variables. - The angle between a variable and a PC axis can be interpreted as the correlation between the two. - The length of the arrows are proportional to the standard deviations of the variables. |
| Other resources: |
| I also found another link BioTuring which I found quite useful: |
| - Bottom axis: PC1 score - Left axis: PC2 score - Top axis: loadings on PC1 - Right axis: loadings on PC2 |
| - vectors that are close to each other are positively correlated - 90 degree = no correlation - 180 degree = negative correlation. |
| Biplot (unstandardized) |
| Biplot is used to visualize associations between variables. The country names are located based on the two principal components (PC1 and PC2), the arrows indicate how the different variables (e.g., GNI, Life expectancy etc., are related to the components). |
| Since, the data is unstandardized the values (min-max) of the variables determine the connection. The length of the arrow indicates the standard deviations of the variable. Since, GNI had a range between 581 (Central African Republic) and 123124 (Qatar), which is far bigger than any other variable (hence biggest standard deviations), it overpowers the biplot. Country with highest GNI is Qatar, followed by Kuwait, and Singapore are closest to GNI (x-axis). Countries with lowest GNIs (Central African Republic, Congo, Malawi etc.,) are located furthest from GNI (x-axis). We need to use standardized data, so we can draw better conclusions of how these variables are associated to this phenomena. |
| Biplot (unstandardized) |
| When the data is standardized you can compare the variables better and those are spread around the biplot, unlike in unstandardized version where GNI was the only one, due to its large values and range between the countries. |
| Here you can see that there 3 different clusters (but 2 components) that are affecting our results: |
| - Education: entiles variables related to Edu.Exp, GNI, Edu2.FM, and Life.Exp (PC2) - Birth: entiles Mat.Mor and Ado.Birth (PC2) - Labour: entiles Parli.F and Labo.FM (PC1) |
| - Both Education and Birth components are part of the PC2, since they are close to PC2 origin (0). - Education components (Edu.Exp, Edu2.FM, GNI and Life.Exp) are positively correlated with each other - Mat.Mor and Ado.Birth are positively correlated with each other -The two clusters (Education and Birth) are negatively correlated (180 degree angle) |
| In other words, countries that have high education (in years, and female ratio), life expectancy and GNI, often have low maternal mortality and adolescents pregnancy rates. Countries with high Education-components and low Birth-components are presented left side. For example, Iceland and Norway (top left), Korea and Japan (mid-left), and Qatar and Iran (bottom-left). And vice versa on (right side). |
| - Parli.F and Labo.FM are positively correlated with each other (close to each other) - Both are related to PC1, since they are located close to x-axis origin (0). - Countries that have high (or more even) rates of females in their parliaments and labor force are located on top (e.g., Rwanda, Mozambique, Iceland, Bolivia, Burundi etc.), and countries with low female rate at the bottom (e.g., Iran, Yemen, Egypt etc.) - Labor cluster does not seem to correlate with the other component (PC2: Education and birth), since there is 90 degrees between the variables (arrows). |
| However, based on the arrow lengths there is high variability between the countries for each variable (longer the arrow, higher standard deviation). |
| The proportion of variance in this model is explained 53.6% based on the PC1 (Labor) and 16.2% based on PC2 (Education and Birth) |
| To summarize, countries with high Education, low Birth and high Labor components are located left top corner. |
| ### 5. Tea data (0-4 points) |
| Tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). |
| Load the tea dataset and convert its character variables to factors: |
r tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE) |
| Explore the data briefly: |
| - look at the structure and - the dimensions of the data. - Use View(tea) to browse its contents, and - visualize the data. |
r View(tea) dim(tea) #300 observations, 36 variables |
## [1] 300 36 |
r str(tea) |
## 'data.frame': 300 obs. of 36 variables: ## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ... ## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ... ## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ... ## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ... ## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ... ## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ... ## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ... ## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ... ## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ... ## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ... ## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ... ## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ... ## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ... ## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ... ## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ... ## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ... ## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ... ## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ... ## $ age : int 39 45 47 23 48 21 37 36 40 37 ... ## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ... ## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ... ## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ... ## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ... ## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ... ## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ... ## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ... ## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ... ## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ... ## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ... ## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ... ## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ... ## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ... ## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ... ## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ... |
| Tea data has 300 observations, 36 variables. Most seem to be factors with two levels (“dummy variables”), but there are some with 3, 4 and even 5-7 values. Age is the only numeric (integral) value. |
r library(dplyr) library(tidyr) # column names to keep in the dataset tea_when <- c("breakfast", "tea.time", "evening", "lunch", "dinner", "always", "frequency") tea_where = c("where", "home", "work", "tearoom", "friends", "resto", "pub") tea_how = c("sugar", "Tea", "How", "how", "price") tea_who = c("age_Q", "sex", "SPC", "Sport") tea_why = c("escape.exoticism", "spirituality", "healthy", "diuretic", "friendliness", "iron.absorption", "feminine", "sophisticated", "slimming", "exciting", "relaxing", "effect.on.health") |
r # When tea_when <- select(tea, tea_when) |
## Note: Using an external vector in selections is ambiguous. ## ℹ Use `all_of(tea_when)` instead of `tea_when` to silence this message. ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ## This message is displayed once per session. |
| ```r cb_pal= c(“#F0E442”, “#E69F00”, “#0072B2”, “#56B4E9”, “#F0E442”, “#E69F00”, “#56B4E9”, “#0072B2”, “#F0E442”, “#E69F00”, “#56B4E9”, “#0072B2”, “#F0E442”, “#E69F00”,“#F0E442”, “#E69F00”) |
| pivot_longer(tea_when, cols = everything()) %>% ggplot(aes(x=value, fill=value)) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+ coord_flip() + geom_bar() + scale_fill_manual(values = cb_pal) + labs(title = “When people are drinking tea (dummy)”, x=“When?”) ``` |
| People tend to drink tea: |
| - at least once a day (1/day or +2(day) - around than a half were drinking during breakfast, and had specific time when to drink tea - around one in thirst would also have tea in the evening and have alwayd drank tea. - most people avoided tea drinking during lunch or dinner. |
r # Where tea_where <- select(tea, tea_where) |
## Note: Using an external vector in selections is ambiguous. ## ℹ Use `all_of(tea_where)` instead of `tea_where` to silence this message. ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ## This message is displayed once per session. |
| ```r cb_pal= c(“#F0E442”, “#E69F00”,“#F0E442”, “#56B4E9”, “#0072B2”, “#E69F00”, “#F0E442”, “#0072B2”, “#56B4E9”, “#F0E442”, “#E69F00”, “#0072B2”, “#56B4E9”, “#E69F00”, “#F0E442”) |
| pivot_longer(tea_where, cols = everything()) %>% ggplot(aes(x=value, fill=value)) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+ coord_flip() + geom_bar() + scale_fill_manual(values = cb_pal) + labs(title = “Where people are drinking tea?”, x=“Where?”) ``` |
| - The most common places to drink tea were was at home - people often purchase their tea at chain store instead of from a specific tea store - approximately 25% drank tea a work - people rarely drank tea at tearooms, restaurants or pubs - most people were drinking tea with friends |
r # How tea_how <- select(tea, tea_how) |
## Note: Using an external vector in selections is ambiguous. ## ℹ Use `all_of(tea_how)` instead of `tea_how` to silence this message. ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ## This message is displayed once per session. |
| ```r cb_pal= c(“#0072B2”, “#56B4E9”, “#F0E442”, “#E69F00”, “#F0E442”, “#0072B2”, “#56B4E9”, “#0072B2”, “#56B4E9”, “#E69F00”, “#F0E442”, “#E69F00”, “#0072B2”, “#56B4E9”, “#0072B2”, “#56B4E9”,“#0072B2”, “#56B4E9”) |
| pivot_longer(tea_how, cols = everything()) %>% ggplot(aes(x=value, fill=value)) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+ coord_flip() + geom_bar() + scale_fill_manual(values = cb_pal) + labs(title = “How people are drinking tea?”, x=“How?”) ``` |
| - approximately half had tea with sugar, other half without - most people were drinking early grey, then black and last green tea - most people did not add any addings to their tea, some had milk (approx 60 people), lemon (approx 25 people) or other (approx 10 people) addings. - most used tea bags, almost 100 people used teabagd and unpacked tea, and less than 50 people used only unpacked tea products. - most people bought different types of tea, or used certain specific brands. People rarely bought cheap tea, or variaety they did not know. over 50 people bought upscaled tea (whatever that means) |
r # Who tea_who <- select(tea, tea_who) |
## Note: Using an external vector in selections is ambiguous. ## ℹ Use `all_of(tea_who)` instead of `tea_who` to silence this message. ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ## This message is displayed once per session. |
| ```r cb_pal= c(“#F0E442”, “#E69F00”,“#F0E442”, “#E69F00”,“#F0E442”, “#56B4E9”, “#0072B2”, “#F0E442”, “#E69F00”, “#F0E442”, “#E69F00”,“#F0E442”, “#E69F00”, “#F0E442”, “#0072B2”, “#56B4E9”) |
| pivot_longer(tea_who, cols = everything()) %>% ggplot(aes(x=value, fill=value)) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+ coord_flip() + geom_bar() + scale_fill_manual(values = cb_pal) + labs(title = “Who are are drinking tea?”, x=“Who?”) ``` |
| most common people who drank tea (or at least answered this survey) were: |
| - females - 15-24 or 25-34 years old - students, non-workers or employees - considered themselves “sportman” |
r # Why tea_why <- select(tea, tea_why) |
## Note: Using an external vector in selections is ambiguous. ## ℹ Use `all_of(tea_why)` instead of `tea_why` to silence this message. ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ## This message is displayed once per session. |
| ```r cb_pal= c(“#F0E442”, “#E69F00”, “#0072B2”, “#56B4E9”, “#F0E442”, “#E69F00”, “#56B4E9”, “#0072B2”, “#F0E442”, “#E69F00”, “#56B4E9”, “#0072B2”, “#F0E442”, “#E69F00”, “#0072B2”, “#56B4E9”, “#E69F00”, “#F0E442”, “#56B4E9”, “#0072B2”, “#E69F00”, “#F0E442”, “#56B4E9”, “#0072B2”) |
| pivot_longer(tea_why, cols = everything()) %>% ggplot(aes(x=value, fill=value)) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+ coord_flip() + geom_bar() + scale_fill_manual(values = cb_pal) + labs(title = “Why people are drinking tea (dummy)?”, x=“Why?”) ``` |
| people drank tea because: |
| - Majority of people drank tea because they wanted to relax, tea is a diuretic and healthy product, or social reasons and because they think it is sophisticated manner. - rarely people drank due to its health effects, excitement, slimming purposes, absorption iron, or spiritual reasons - 50-50 though drinking tea was feminine or they wanted to escape exoticism (what ever that means). |
### Instructions - Use Multiple Correspondence Analysis
(MCA) on the tea data (or on just certain columns of the data, it is up
to you!). - Interpret the results of the MCA and - draw at least the
variable biplot of the analysis. - You can also explore
other plotting options for MCA. - Comment on the output of the plots. -
Look at the summary of the model. - Plot the variables of the model. You
can either plot the variables or the individuals or both. You can change
which one to plot with the invisible argument. |
r #install.packages("FactoMineR") library(FactoMineR) |
## Warning: package 'FactoMineR' was built under R version 4.2.2 |
| ```r #mca_tea <- MCA(tea, graph = FALSE) # multiple correspondence analysis mca_when <- MCA(tea_when, graph = FALSE) # multiple correspondence analysis mca_where <- MCA(tea_where, graph = FALSE) mca_how <- MCA(tea_how, graph = FALSE) mca_who <- MCA(tea_who, graph = FALSE) mca_why <- MCA(tea_why, graph = FALSE) |
| # summary of the model summary(mca_when) ``` |
## ## Call: ## MCA(X = tea_when, graph = FALSE) ## ## ## Eigenvalues ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 ## Variance 0.241 0.208 0.167 0.144 0.135 0.117 0.108 ## % of var. 18.738 16.185 13.015 11.216 10.516 9.118 8.407 ## Cumulative % of var. 18.738 34.922 47.937 59.153 69.669 78.787 87.194 ## Dim.8 Dim.9 ## Variance 0.094 0.071 ## % of var. 7.280 5.526 ## Cumulative % of var. 94.474 100.000 ## ## Individuals (the 10 first) ## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr ## 1 | 0.227 0.072 0.062 | -0.707 0.801 0.601 | -0.120 0.029 ## 2 | 0.227 0.072 0.062 | -0.707 0.801 0.601 | -0.120 0.029 ## 3 | 0.428 0.253 0.068 | 0.255 0.104 0.024 | 0.181 0.065 ## 4 | 1.218 2.052 0.550 | -0.369 0.219 0.051 | -0.049 0.005 ## 5 | -0.459 0.291 0.189 | 0.267 0.114 0.064 | -0.649 0.839 ## 6 | 1.218 2.052 0.550 | -0.369 0.219 0.051 | -0.049 0.005 ## 7 | -0.071 0.007 0.003 | -0.247 0.098 0.039 | 0.679 0.920 ## 8 | 0.296 0.121 0.060 | 0.632 0.640 0.274 | 0.272 0.147 ## 9 | -0.412 0.235 0.264 | -0.439 0.308 0.298 | -0.013 0.000 ## 10 | -0.317 0.139 0.110 | -0.035 0.002 0.001 | -0.182 0.066 ## cos2 ## 1 0.017 | ## 2 0.017 | ## 3 0.012 | ## 4 0.001 | ## 5 0.377 | ## 6 0.001 | ## 7 0.294 | ## 8 0.051 | ## 9 0.000 | ## 10 0.036 | ## ## Categories (the 10 first) ## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test ## breakfast | -0.599 10.200 0.331 -9.945 | -0.574 10.875 0.305 -9.544 | ## Not.breakfast | 0.553 9.416 0.331 9.945 | 0.530 10.038 0.305 9.544 | ## Not.tea time | 0.475 5.840 0.175 7.230 | 0.087 0.228 0.006 1.327 | ## tea time | -0.368 4.527 0.175 -7.230 | -0.068 0.177 0.006 -1.327 | ## evening | -0.338 2.333 0.060 -4.232 | 0.745 13.076 0.290 9.313 | ## Not.evening | 0.177 1.220 0.060 4.232 | -0.389 6.837 0.290 -9.313 | ## lunch | -1.099 10.511 0.208 -7.881 | 0.691 4.803 0.082 4.951 | ## Not.lunch | 0.189 1.807 0.208 7.881 | -0.119 0.825 0.082 -4.951 | ## dinner | 2.094 18.204 0.330 9.935 | -0.024 0.003 0.000 -0.114 | ## Not.dinner | -0.158 1.370 0.330 -9.935 | 0.002 0.000 0.000 0.114 | ## Dim.3 ctr cos2 v.test ## breakfast -0.153 0.960 0.022 -2.542 | ## Not.breakfast 0.141 0.886 0.022 2.542 | ## Not.tea time -0.472 8.318 0.173 -7.191 | ## tea time 0.366 6.448 0.173 7.191 | ## evening 0.232 1.583 0.028 2.906 | ## Not.evening -0.122 0.828 0.028 -2.906 | ## lunch 1.025 13.160 0.181 7.349 | ## Not.lunch -0.176 2.262 0.181 -7.349 | ## dinner -0.086 0.044 0.001 -0.406 | ## Not.dinner 0.006 0.003 0.001 0.406 | ## ## Categorical variables (eta2) ## Dim.1 Dim.2 Dim.3 ## breakfast | 0.331 0.305 0.022 | ## tea.time | 0.175 0.006 0.173 | ## evening | 0.060 0.290 0.028 | ## lunch | 0.208 0.082 0.181 | ## dinner | 0.330 0.000 0.001 | ## always | 0.053 0.209 0.402 | ## frequency | 0.530 0.565 0.365 | |
| Tips how to read this summary and the plot below can be found here and here |
| Explained variance |
| - More variance is explained the dewer insight is missing - The first dimension explains the variance of when people are drinking tea 19%. teh second 16% and so on. The last dimension (9) only explains 5% of the variance. - The dimension 1 (19%) and 2 (16%) already explains 35% of the total variance. NOTE. Not all variables will fit this interpretation of two-dimensionality. - All in all, when people are drinking tea is not alone good-enough to explain the differences in peoples tea habits. cos2 cateogries |
| - When the sum of cos2 is close to one, the variables are well presented. - For example, in this summary both breakfast can be explained by 2 dimensions (approx. cos2 in Dim2 is .300), - but dinner is mostly explained by the first dimension (cos2 is .330, when for dim2 and dim3 is zero). - cos2 individuals would go through how these are representing each participant who took a part in this survey. |
| Eta2 |
| - is often used to describe the effect size, unsure if that is the case here too. simplified version of explaining eta is to try ro rhink that it is a proportion of total variance explained in specific variable when it is part of certain group. For exampe, dinner is mainly associated with dimension1. |
r # visualize MCA plot(mca_when, invisible=c("ind"), graph.type = "classic", habillage = "quali") |
| - The variables categories with similar profiles are grouped together. - The further the labels are from the origin he more discriminating they are - Negatively correlated variables are opposite sides of the plot origin. - The distance between category points and the origin measures the quality of the variable category on the factor map. Category points that are away from the origin are well represented on the factor map. |
| For example, |
| - people who have tea more than twice a day, often would drink the during lunch and evening, and have specific teatimes (top-left). - People who drank only once a day, would most likely to drink in the evening (bottom-right), and they have not always drank tea. - there is also people who only have tea at breakfast (bottom-left) - and people who are occasional tea-drinkers (top-right) |
| Also, |
| - drinking tea once a day or 1 to 2 times a week are the most discriminating factors (furthest from the origin). - usually the opposites are negatively correlated, which makes sense in dummy variables (opposite sites of the origin, eg., breakfast and no.breakfast) - not dinner is not very good discriminat variable (most people dont drink tea during dinner). Unlike, people who drink tea during dinner, seem to be very distinct group from the rest (away from the origin). |
r summary(mca_where) |
## ## Call: ## MCA(X = tea_where, graph = FALSE) ## ## ## Eigenvalues ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 ## Variance 0.264 0.163 0.148 0.132 0.123 0.118 0.102 ## % of var. 23.080 14.269 12.931 11.550 10.777 10.323 8.949 ## Cumulative % of var. 23.080 37.348 50.279 61.830 72.606 82.930 91.879 ## Dim.8 ## Variance 0.093 ## % of var. 8.121 ## Cumulative % of var. 100.000 ## ## Individuals (the 10 first) ## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ## 1 | -0.639 0.516 0.763 | 0.023 0.001 0.001 | -0.178 ## 2 | -0.639 0.516 0.763 | 0.023 0.001 0.001 | -0.178 ## 3 | 0.285 0.103 0.083 | 0.567 0.657 0.327 | -0.066 ## 4 | -0.639 0.516 0.763 | 0.023 0.001 0.001 | -0.178 ## 5 | -0.639 0.516 0.763 | 0.023 0.001 0.001 | -0.178 ## 6 | -0.639 0.516 0.763 | 0.023 0.001 0.001 | -0.178 ## 7 | -0.313 0.124 0.286 | -0.029 0.002 0.003 | -0.153 ## 8 | -0.639 0.516 0.763 | 0.023 0.001 0.001 | -0.178 ## 9 | -0.243 0.075 0.068 | -0.276 0.155 0.088 | -0.203 ## 10 | 0.222 0.062 0.035 | -0.418 0.356 0.122 | 0.021 ## ctr cos2 ## 1 0.072 0.059 | ## 2 0.072 0.059 | ## 3 0.010 0.004 | ## 4 0.072 0.059 | ## 5 0.072 0.059 | ## 6 0.072 0.059 | ## 7 0.053 0.068 | ## 8 0.072 0.059 | ## 9 0.093 0.048 | ## 10 0.001 0.000 | ## ## Categories (the 10 first) ## Dim.1 ctr cos2 v.test Dim.2 ctr ## chain store | -0.376 4.898 0.251 -8.667 | 0.424 10.101 ## chain store+tea shop | 1.049 15.487 0.386 10.749 | -0.419 4.007 ## tea shop | -0.321 0.557 0.011 -1.849 | -1.626 23.163 ## home | 0.007 0.002 0.002 0.673 | -0.110 1.032 ## Not.home | -0.221 0.080 0.002 -0.673 | 3.563 33.357 ## Not.work | -0.250 2.408 0.153 -6.771 | -0.208 2.687 ## work | 0.613 5.897 0.153 6.771 | 0.509 6.578 ## Not.tearoom | -0.323 4.566 0.436 -11.418 | 0.077 0.424 ## tearoom | 1.349 19.050 0.436 11.418 | -0.323 1.771 ## friends | 0.406 5.845 0.311 9.648 | -0.051 0.150 ## cos2 v.test Dim.3 ctr cos2 v.test ## chain store 0.320 9.786 | -0.216 2.880 0.083 -4.974 | ## chain store+tea shop 0.062 -4.299 | -0.281 1.990 0.028 -2.884 | ## tea shop 0.294 -9.372 | 2.112 43.136 0.496 12.176 | ## home 0.393 -10.834 | -0.106 1.059 0.365 -10.449 | ## Not.home 0.393 10.834 | 3.436 34.240 0.365 10.449 | ## Not.work 0.106 -5.623 | -0.167 1.914 0.068 -4.519 | ## work 0.106 5.623 | 0.409 4.687 0.068 4.519 | ## Not.tearoom 0.025 2.737 | -0.116 1.053 0.056 -4.105 | ## tearoom 0.025 -2.737 | 0.485 4.394 0.056 4.105 | ## friends 0.005 -1.214 | 0.024 0.035 0.001 0.563 | ## ## Categorical variables (eta2) ## Dim.1 Dim.2 Dim.3 ## where | 0.387 0.425 0.497 | ## home | 0.002 0.393 0.365 | ## work | 0.153 0.106 0.068 | ## tearoom | 0.436 0.025 0.056 | ## friends | 0.311 0.005 0.001 | ## resto | 0.322 0.182 0.023 | ## pub | 0.236 0.006 0.024 | |
r summary(mca_how) |
## ## Call: ## MCA(X = tea_how, graph = FALSE) ## ## ## Eigenvalues ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 ## Variance 0.325 0.259 0.247 0.239 0.208 0.205 0.200 ## % of var. 12.498 9.973 9.506 9.187 8.002 7.878 7.698 ## Cumulative % of var. 12.498 22.470 31.976 41.163 49.166 57.044 64.742 ## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 ## Variance 0.190 0.175 0.165 0.147 0.130 0.109 ## % of var. 7.293 6.715 6.361 5.660 5.017 4.211 ## Cumulative % of var. 72.036 78.750 85.111 90.771 95.789 100.000 ## ## Individuals (the 10 first) ## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2 ## 1 | -0.472 0.229 0.038 | 0.804 0.831 0.110 | 0.985 1.309 0.165 ## 2 | 0.101 0.010 0.005 | 0.043 0.002 0.001 | 0.886 1.059 0.385 ## 3 | -0.141 0.020 0.022 | -0.466 0.279 0.243 | -0.058 0.005 0.004 ## 4 | -0.446 0.204 0.216 | -0.328 0.138 0.117 | -0.160 0.034 0.028 ## 5 | -0.141 0.020 0.022 | -0.466 0.279 0.243 | -0.058 0.005 0.004 ## 6 | -0.341 0.119 0.036 | 0.238 0.073 0.018 | -0.193 0.050 0.012 ## 7 | -0.141 0.020 0.022 | -0.466 0.279 0.243 | -0.058 0.005 0.004 ## 8 | 0.101 0.010 0.005 | 0.043 0.002 0.001 | 0.886 1.059 0.385 ## 9 | 0.492 0.249 0.100 | -0.034 0.001 0.000 | 0.300 0.121 0.037 ## 10 | 1.115 1.275 0.546 | -0.076 0.007 0.003 | 0.247 0.083 0.027 ## ## 1 | ## 2 | ## 3 | ## 4 | ## 5 | ## 6 | ## 7 | ## 8 | ## 9 | ## 10 | ## ## Categories (the 10 first) ## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test ## No.sugar | 0.420 5.602 0.188 7.504 | -0.170 1.151 0.031 -3.039 | ## sugar | -0.449 5.989 0.188 -7.504 | 0.182 1.231 0.031 3.039 | ## black | 0.842 10.770 0.232 8.334 | 0.388 2.859 0.049 3.836 | ## Earl Grey | -0.389 6.003 0.273 -9.042 | -0.206 2.115 0.077 -4.795 | ## green | 0.388 1.021 0.019 2.361 | 0.338 0.971 0.014 2.056 | ## alone | 0.099 0.391 0.018 2.328 | -0.148 1.093 0.040 -3.479 | ## lemon | 0.029 0.006 0.000 0.175 | 0.174 0.257 0.004 1.057 | ## milk | -0.443 2.541 0.052 -3.953 | 0.554 4.965 0.081 4.936 | ## other | 0.857 1.357 0.023 2.607 | -1.314 3.996 0.053 -3.996 | ## tea bag | -0.539 10.151 0.381 -10.667 | 0.290 3.678 0.110 5.736 | ## Dim.3 ctr cos2 v.test ## No.sugar 0.122 0.624 0.016 2.184 | ## sugar -0.131 0.667 0.016 -2.184 | ## black 0.969 18.736 0.307 9.586 | ## Earl Grey -0.139 1.012 0.035 -3.238 | ## green -1.357 16.393 0.228 -8.250 | ## alone -0.371 7.240 0.256 -8.742 | ## lemon -0.214 0.407 0.006 -1.301 | ## milk 0.867 12.778 0.200 7.731 | ## other 2.753 18.395 0.234 8.371 | ## tea bag 0.076 0.265 0.008 1.504 | ## ## Categorical variables (eta2) ## Dim.1 Dim.2 Dim.3 ## sugar | 0.188 0.031 0.016 | ## Tea | 0.289 0.077 0.447 | ## How | 0.070 0.134 0.480 | ## how | 0.521 0.447 0.041 | ## price | 0.556 0.608 0.253 | |
r summary(mca_who) |
## ## Call: ## MCA(X = tea_who, graph = FALSE) ## ## ## Eigenvalues ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 ## Variance 0.460 0.436 0.318 0.288 0.279 0.255 0.248 ## % of var. 15.322 14.548 10.600 9.585 9.297 8.493 8.279 ## Cumulative % of var. 15.322 29.871 40.471 50.056 59.353 67.846 76.126 ## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 ## Variance 0.223 0.182 0.156 0.084 0.072 ## % of var. 7.432 6.059 5.209 2.791 2.384 ## Cumulative % of var. 83.557 89.616 94.825 97.616 100.000 ## ## Individuals (the 10 first) ## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2 ## 1 | 0.021 0.000 0.000 | 0.927 0.656 0.227 | -0.218 0.050 0.013 ## 2 | -0.033 0.001 0.000 | 0.488 0.182 0.081 | -0.577 0.349 0.113 ## 3 | -0.015 0.000 0.000 | 0.466 0.166 0.045 | -0.773 0.627 0.124 ## 4 | -0.564 0.231 0.150 | -0.553 0.233 0.144 | 0.173 0.031 0.014 ## 5 | -0.147 0.016 0.008 | 0.612 0.286 0.148 | 0.013 0.000 0.000 ## 6 | -0.939 0.640 0.459 | -0.455 0.158 0.108 | 0.523 0.287 0.143 ## 7 | 0.062 0.003 0.001 | 1.072 0.878 0.284 | 0.125 0.016 0.004 ## 8 | 0.019 0.000 0.000 | 0.553 0.234 0.085 | -0.797 0.667 0.177 ## 9 | 0.062 0.003 0.001 | 1.072 0.878 0.284 | 0.125 0.016 0.004 ## 10 | 0.438 0.139 0.045 | 0.975 0.726 0.224 | -0.226 0.053 0.012 ## ## 1 | ## 2 | ## 3 | ## 4 | ## 5 | ## 6 | ## 7 | ## 8 | ## 9 | ## 10 | ## ## Categories (the 10 first) ## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test ## +60 | 1.913 25.210 0.531 12.598 | -1.103 8.832 0.177 -7.265 ## 15-24 | -0.960 15.370 0.408 -11.040 | -0.901 14.258 0.359 -10.361 ## 25-34 | -0.244 0.746 0.018 -2.307 | 0.744 7.301 0.166 7.035 ## 35-44 | 0.407 1.201 0.025 2.761 | 0.831 5.273 0.106 5.636 ## 45-59 | 0.265 0.779 0.018 2.318 | 0.659 5.060 0.111 5.758 ## F | -0.003 0.000 0.000 -0.054 | -0.401 5.472 0.235 -8.381 ## M | 0.004 0.000 0.000 0.054 | 0.585 7.983 0.235 8.381 ## employee | -0.256 0.700 0.016 -2.189 | 0.269 0.816 0.018 2.303 ## middle | 0.058 0.024 0.001 0.393 | 0.929 6.597 0.133 6.303 ## non-worker | 1.442 24.125 0.564 12.985 | -0.857 8.967 0.199 -7.714 ## Dim.3 ctr cos2 v.test ## +60 | 0.876 7.634 0.111 5.766 | ## 15-24 | -0.013 0.004 0.000 -0.148 | ## 25-34 | 0.755 10.318 0.170 7.139 | ## 35-44 | -1.134 13.470 0.198 -7.688 | ## 45-59 | -0.637 6.489 0.104 -5.566 | ## F | -0.531 13.167 0.412 -11.097 | ## M | 0.775 19.210 0.412 11.097 | ## employee | -0.428 2.829 0.045 -3.659 | ## middle | -0.453 2.151 0.032 -3.072 | ## non-worker | 0.462 3.581 0.058 4.161 | ## ## Categorical variables (eta2) ## Dim.1 Dim.2 Dim.3 ## age_Q | 0.796 0.711 0.482 | ## sex | 0.000 0.235 0.412 | ## SPC | 0.793 0.784 0.227 | ## Sport | 0.249 0.016 0.151 | |
r summary(mca_why) |
## ## Call: ## MCA(X = tea_why, graph = FALSE) ## ## ## Eigenvalues ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 ## Variance 0.141 0.121 0.104 0.098 0.096 0.081 0.073 ## % of var. 14.140 12.066 10.407 9.788 9.606 8.136 7.271 ## Cumulative % of var. 14.140 26.207 36.613 46.401 56.007 64.143 71.414 ## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 ## Variance 0.066 0.063 0.059 0.052 0.045 ## % of var. 6.626 6.280 5.905 5.237 4.537 ## Cumulative % of var. 78.040 84.321 90.226 95.463 100.000 ## ## Individuals (the 10 first) ## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ## 1 | -0.691 1.124 0.425 | -0.345 0.329 0.106 | -0.384 ## 2 | -0.307 0.222 0.081 | 0.155 0.066 0.020 | -0.565 ## 3 | -0.215 0.109 0.071 | -0.518 0.742 0.411 | -0.255 ## 4 | -0.123 0.036 0.015 | -0.284 0.223 0.079 | 0.539 ## 5 | -0.190 0.085 0.037 | -0.038 0.004 0.001 | 0.321 ## 6 | -0.608 0.872 0.357 | -0.578 0.922 0.322 | -0.036 ## 7 | -0.608 0.872 0.357 | -0.578 0.922 0.322 | -0.036 ## 8 | 0.048 0.005 0.004 | -0.448 0.554 0.347 | 0.161 ## 9 | -0.612 0.883 0.368 | -0.205 0.116 0.041 | 0.308 ## 10 | -0.215 0.109 0.071 | -0.518 0.742 0.411 | -0.255 ## ctr cos2 ## 1 0.473 0.131 | ## 2 1.023 0.274 | ## 3 0.209 0.100 | ## 4 0.931 0.285 | ## 5 0.329 0.105 | ## 6 0.004 0.001 | ## 7 0.004 0.001 | ## 8 0.083 0.045 | ## 9 0.303 0.093 | ## 10 0.209 0.100 | ## ## Categories (the 10 first) ## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 ## escape-exoticism | 0.214 1.283 0.041 3.515 | 0.394 5.087 0.140 ## Not.escape-exoticism | -0.193 1.153 0.041 -3.515 | -0.355 4.572 0.140 ## Not.spirituality | -0.236 2.253 0.122 -6.039 | -0.028 0.037 0.002 ## spirituality | 0.517 4.937 0.122 6.039 | 0.061 0.081 0.002 ## healthy | 0.314 4.064 0.230 8.290 | -0.350 5.915 0.285 ## Not.healthy | -0.732 9.483 0.230 -8.290 | 0.816 13.801 0.285 ## diuretic | 0.406 5.637 0.228 8.252 | 0.122 0.594 0.020 ## Not.diuretic | -0.561 7.785 0.228 -8.252 | -0.168 0.820 0.020 ## friendliness | 0.156 1.161 0.102 5.519 | -0.008 0.004 0.000 ## Not.friendliness | -0.652 4.842 0.102 -5.519 | 0.034 0.015 0.000 ## v.test Dim.3 ctr cos2 v.test ## escape-exoticism 6.466 | 0.426 6.871 0.163 6.979 | ## Not.escape-exoticism -6.466 | -0.383 6.175 0.163 -6.979 | ## Not.spirituality -0.712 | -0.236 3.072 0.122 -6.050 | ## spirituality 0.712 | 0.518 6.731 0.122 6.050 | ## healthy -9.239 | -0.200 2.243 0.093 -5.284 | ## Not.healthy 9.239 | 0.467 5.234 0.093 5.284 | ## diuretic 2.475 | -0.326 4.944 0.147 -6.630 | ## Not.diuretic -2.475 | 0.451 6.828 0.147 6.630 | ## friendliness -0.287 | -0.014 0.012 0.001 -0.485 | ## Not.friendliness 0.287 | 0.057 0.051 0.001 0.485 | ## ## Categorical variables (eta2) ## Dim.1 Dim.2 Dim.3 ## escape.exoticism | 0.041 0.140 0.163 | ## spirituality | 0.122 0.002 0.122 | ## healthy | 0.230 0.285 0.093 | ## diuretic | 0.228 0.020 0.147 | ## friendliness | 0.102 0.000 0.001 | ## iron.absorption | 0.068 0.008 0.011 | ## feminine | 0.310 0.009 0.007 | ## sophisticated | 0.215 0.030 0.090 | ## slimming | 0.275 0.055 0.003 | ## exciting | 0.030 0.258 0.127 | |
r # visualize MCA plot(mca_where, invisible=c("ind"), graph.type = "classic", habillage = "quali") |
r plot(mca_how, invisible=c("ind"), graph.type = "classic", habillage = "quali") |
r plot(mca_who, invisible=c("ind"), graph.type = "classic", habillage = "quali") |
r plot(mca_why, invisible=c("ind"), graph.type = "classic", habillage = "quali") |
| *** |
r date() |
## [1] "Mon Dec 12 11:05:08 2022" |
| # Assignment 6: Tasks and Instructions |
| ## Data wrangling (max 5 points) |
| See meet_and_repeat.R. |
| ## Analysis (max 15 points) |
| - Use the corresponding Exercise Set and the MABS4IODS materials - Note. As you read the data sets in your chapter6.Rmd you may have to factor the categorical variables again, as R may not recognise them automatically as factors. - Note. that you must SWAP the data sets! :) It is NOT a simple copy & paste of the MABS book! |
r #install.packages('ggplot2') library(dplyr) library(tidyr) library(ggplot2) |
| ## RATS data and Chapter8 (0-7 points) |
| Implement the analyses of Chapter 8 of MABS using the RATS data. |
| - 0-4 points for graphs or analysis results - 0-3 points for their interpretations |
| Download and convert the data to long form |
| ```r RATS3 <- read.table(“https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt”, header = TRUE, sep = ’) RATS3\(ID <- factor(RATS3\)ID) #change these to factors RATS3\(Group <- factor(RATS3\)Group) #change these to factors |
| RATSL3 <- pivot_longer(RATS3, cols = -c(ID, Group), names_to = “WD”, values_to = “Weight”) %>% mutate(Time3 = as.integer(substr(WD,3,4))) %>% arrange(Time3) #change it to long-form |
| glimpse(RATSL3) #sanity-check ``` |
## Rows: 176 ## Columns: 5 ## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,… ## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, … ## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", … ## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470… ## $ Time3 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, … |
| ### Summary RATS |
r names(RATSL3) #"ID" "Group" "WD" "Weight" "Time3" |
## [1] "ID" "Group" "WD" "Weight" "Time3" |
r str(RATSL3) #176 observations and 5 columns (variables) |
## tibble [176 × 5] (S3: tbl_df/tbl/data.frame) ## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ... ## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ... ## $ WD : chr [1:176] "WD1" "WD1" "WD1" "WD1" ... ## $ Weight: int [1:176] 240 225 245 260 255 260 275 245 410 405 ... ## $ Time3 : int [1:176] 1 1 1 1 1 1 1 1 1 1 ... |
r summary(RATSL3) #ID and Group factors, WD chr and Weight and Time3 int |
## ID Group WD Weight Time3 ## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00 ## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00 ## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00 ## 4 : 11 Mean :384.5 Mean :33.55 ## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00 ## 6 : 11 Max. :628.0 Max. :64.00 ## (Other):110 |
| ### Individuals on the plot |
| Graphical displays of individual datapoint: Weight values over time between 3 groups. |
r library(dplyr); library(tidyr) ggplot(RATSL3, aes(x = Time3, y = Weight, linetype = ID)) + geom_line() + scale_linetype_manual(values = rep(1:10, times=4)) + facet_grid(. ~ Group, labeller = label_both) + theme(legend.position = "none") + scale_y_continuous(limits = c(min(RATSL3$Weight), max(RATSL3$Weight))) |
| Interpretation of the plot: Individual plot - First, in almost every group the weight is increasing over time (but in group 1 it remains quite stable). - Second, Group 2 and 3 have higher weight than group 1 - Third, there is not much change in weight within the groups. |
| ### The Golden Standardise - tracking |
| “The tracking phenomenon can be seen more clearly in a plot of the standardized values of each observation, i.e., the values obtained by subtracting the relevant occasion mean from the original observation and then dividing by the corresponding visit standard deviation.” (Exercise6.Rmd) |
| ```r library(dplyr); library(tidyr); library(ggplot2) |
| #standardize Weight RATSL3 <- RATSL3 %>% group_by(Time3) %>% mutate(stdWeight = Weight) %>% ungroup() glimpse(RATSL3) # sanity-check; stdWeight added ``` |
## Rows: 176 ## Columns: 6 ## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,… ## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, … ## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1… ## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, … ## $ Time3 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, … ## $ stdWeight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, … |
r # Plot again with the standardised Weight ggplot(RATSL3, aes(x = Time3, y = stdWeight, linetype = ID)) + geom_line() + scale_linetype_manual(values = rep(1:10, times=4)) + facet_grid(. ~ Group, labeller = label_both) + scale_y_continuous(name = "standardized Weight") |
| Interpretation of the plot - individual (standardized) |
| Since weight is often normally distributed it did not change our results. To double-check this we can use identical() command. |
r identical(RATSL3$Weight, RATSL3$stdWeight) #TRUE |
## [1] TRUE |
| ### Summary graphs |
| With large numbers of observations, graphical displays of individual response profiles are of little use and investigators then commonly produce graphs showing average (mean) profiles for each treatment group along with some indication of the variation of the observations at each time point, in this case the standard error of mean |
| ```r library(dplyr); library(tidyr); library(ggplot2) |
| RATSL3SS <- RATSL3 %>% group_by(Group, Time3) %>% summarise( mean = Weight, se = Weight ) %>% ungroup() ``` |
## `summarise()` has grouped output by 'Group', 'Time3'. You can override using ## the `.groups` argument. |
r glimpse(RATSL3SS) #sanity-check, means and se added |
## Rows: 176 ## Columns: 4 ## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ Time3 <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, … ## $ mean <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260,… ## $ se <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260,… |
r ggplot(RATSL3SS, aes(x = Time3, y = mean, linetype = Group, shape = Group)) + geom_line() + scale_linetype_manual(values = c(1,2,3)) + geom_point(size=3) + scale_shape_manual(values = c(1,2,3)) + #geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) + theme(legend.position = c(0.9,0.4)) + #change the legend poistion scale_y_continuous(name = "mean(Weight) +/- se(Weight)") |
r ggplot(RATSL3SS, aes(x = Time3, y = mean, linetype = Group, shape = Group)) + geom_line() + scale_linetype_manual(values = c(1,2,3)) + geom_point(size=3) + scale_shape_manual(values = c(1,2,3)) + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) + theme(legend.position = c(0.9,0.8)) + #change the legend poistion scale_y_continuous(name = "mean(Weight) +/- se(Weight)") |
| Interpretation of the plot - summary graphs |
| These two plots are identical, the last one just have error bars added. Here you can see that |
| - Group3 (circle) has the lowest average weight in comaprison to group2 and group3, and that the weight stays quite stabel over the time period. - Group2 (triangle) has similar weight pattern to group3 (cross). The weight seem to go up and down based on different time points. - The change in weight seem to be greater for group2 than group3 - when counting the error bars it seem that there is quite a lot of variability within the groups. |
| ### Outliers |
| Create a summary data by treatment and subject with mean as the summary variable and draw a boxplot to investigate the outliers and delete them, if any. |
| ```r library(dplyr); library(tidyr); library(ggplot2) |
| RATSL3S <- RATSL3 %>% group_by(Group, ID) %>% summarise( mean=mean(Weight) ) %>% ungroup() ``` |
## `summarise()` has grouped output by 'Group'. You can override using the ## `.groups` argument. |
r glimpse(RATSL3S) #sanity-check |
## Rows: 16 ## Columns: 3 ## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3 ## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.… |
r # Draw a boxplot of the mean versus Group ggplot(RATSL3S, aes(x = Group, y = mean)) + geom_boxplot() + stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") + scale_y_continuous(name = "mean(Weight), Time 1-64") + ylim(200, 600) |
## Scale for 'y' is already present. Adding another scale for 'y', which will ## replace the existing scale. |
| Interpretation of the plot - outliers |
| It seems that each group has an outlier. To get rid of them I found this useful link. See case_when |
| - Group1 has an outlier mean weight (min 1 = 237.6) - Group2 has an outlier mean weight (max 2 = 590.5) - Group3 has an outlier mean weight (min 3 = 492.9) - See tapply commands below |
r RATSL3S <- RATSL3 %>% group_by(Group, ID) %>% summarise( mean=mean(Weight) ) %>% ungroup() |
## `summarise()` has grouped output by 'Group'. You can override using the ## `.groups` argument. |
r glimpse(RATSL3S) #sanity-check |
## Rows: 16 ## Columns: 3 ## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3 ## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.… |
r tapply(RATSL3S$mean, RATSL3S$Group, min) # 1=237.6, 2=440.8, 3=492.9 |
## 1 2 3 ## 237.6364 440.8182 492.9091 |
r tapply(RATSL3S$mean, RATSL3S$Group, max) # 1=274.7, 2=590.5, 3=540.2 |
## 1 2 3 ## 274.7273 590.4545 540.1818 |
| ```r RATSL3S_filter = RATSL3S %>% filter(case_when(Group==“1”~ mean>238, Group==“2”~ mean<590, Group==“3”~ mean>493)) |
| ggplot(RATSL3S_filter, aes(x = Group, y = mean)) + geom_boxplot() + stat_summary(fun = “mean”, geom = “point”, shape=23, size=4, fill = “white”) + scale_y_continuous(name = “mean(Weight), Time 1-64”) + ylim(200, 600) ``` |
## Scale for 'y' is already present. Adding another scale for 'y', which will ## replace the existing scale. |
| Interpretation of the plot - outliers2 |
| Now the outliers are deleted. Also, the shapes of the boxplots have changed, since we deleted the outliers that also impacted on the quartilies and means. Group 1 has the lowest average weight, followed by Group2 and Group3. Below you can see the comparison between the datasets. |
r tapply(RATSL3S$mean, RATSL3S$Group, summary) |
## $`1` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 237.6 260.9 266.0 263.7 270.8 274.7 ## ## $`2` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 440.8 449.8 453.8 484.7 488.7 590.5 ## ## $`3` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 492.9 523.6 535.0 525.8 537.2 540.2 |
r tapply(RATSL3S_filter$mean, RATSL3S_filter$Group, summary) |
## $`1` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 260.2 263.3 266.5 267.4 272.0 274.7 ## ## $`2` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 440.8 446.8 452.7 449.5 453.8 454.8 ## ## $`3` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 533.8 535.0 536.3 536.8 538.2 540.2 |
| ### T-test and Anova |
| We use data without outliers (RATSL3S_filter). The t-test is used when you compare means of 2 groups and ANOVA two or more groups. Since RATS has 3 groups, we can a) conduct ANOVA, or b) choose two groups that we compare. |
| ANOVA |
r #one way oneway.test(mean ~ Group, data = RATSL3S_filter, var.equal = TRUE) |
## ## One-way analysis of means ## ## data: mean and Group ## F = 2577.4, num df = 2, denom df = 10, p-value = 2.721e-14 |
r #another way res_aov <- aov(mean ~ Group, data = RATSL3S_filter) summary(res_aov) |
## Df Sum Sq Mean Sq F value Pr(>F) ## Group 2 175958 87979 2577 2.72e-14 *** ## Residuals 10 341 34 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
| ANOVAS p-value is 2.721e-14, which is p<.05, meaning there is difference in mean weight between the groups. Next, we need to run Post-Hoc tests, to see where excatly the difference is located. |
| T-test can be used as a Post-Hoc test (all tehe other tests have similar idea). But to overcome the problem of multiple comparison, we need to multiply the p-value based on the number of comparisons to see if it still stands. Other option is to divide our chosen p-value (p<.05) by the number of comparison to see what the true acceptable p-value would be (0.05/3=0.01666667). So, the acceptable p-value instead of p<.05 it actually is p<.02. |
| For conducting t-tests, again, I got some help from Stackoverflow |
| Two-sample t-test |
r t.test(mean ~ Group, data=RATSL3S_filter[RATSL3S_filter$Group %in% c(1,2),]) #Group 1 and 2 |
## ## Welch Two Sample t-test ## ## data: mean by Group ## t = -37.206, df = 3.1002, p-value = 3.259e-05 ## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0 ## 95 percent confidence interval: ## -197.3009 -166.7251 ## sample estimates: ## mean in group 1 mean in group 2 ## 267.4416 449.4545 |
r t.test(mean ~ Group, data=RATSL3S_filter[RATSL3S_filter$Group %in% c(1,3),]) #Group 1 and 3 |
## ## Welch Two Sample t-test ## ## data: mean by Group ## t = -93.176, df = 7.0287, p-value = 3.95e-12 ## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0 ## 95 percent confidence interval: ## -276.1451 -262.4869 ## sample estimates: ## mean in group 1 mean in group 3 ## 267.4416 536.7576 |
r t.test(mean ~ Group, data=RATSL3S_filter[RATSL3S_filter$Group %in% c(2,3),]) #Group 2 and 3 |
## ## Welch Two Sample t-test ## ## data: mean by Group ## t = -18.428, df = 2.6996, p-value = 0.0006444 ## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0 ## 95 percent confidence interval: ## -103.37511 -71.23095 ## sample estimates: ## mean in group 2 mean in group 3 ## 449.4545 536.7576 |
| There are statistically significant (p<.05) difference between |
| - Group 1 and 2, p-value = 3.259e-05 = 3.259e-053 = 9.777e-05 - Group 1 and 3, p-value = 3.95e-12 = 3.95e-123 = 1.185e-11 - Group 2 and 3, p-value = 0.0006444 = 0.0006444*3 = 0.0019332 |
| Since the p-values for the current comparisons are so low, even multiplying them with number of comparison (n=3), they stay statistically significant. |
| ** Fit the linear model** |
| ```r library(dplyr); library(tidyr) |
| # Add the baseline from the original data as a new variable to the summary data RATSL3S_2 <- RATSL3S %>% #use the data without filters, otherwise the dimensions does nto match mutate(baseline = RATS3$WD1) #use WD1 as a baseline glimpse(RATSL3S_2) #sanity-check baseline has been added to the dataframe ``` |
## Rows: 16 ## Columns: 4 ## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3 ## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 2… ## $ baseline <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4… |
r fit <- lm(mean~Group+baseline, data = RATSL3S_2) # Fit the linear model with the mean as the response and group and WD1 (baseline) as indicators anova(fit) # Compute the analysis of variance table for the fitted model with anova() |
## Analysis of Variance Table ## ## Response: mean ## Df Sum Sq Mean Sq F value Pr(>F) ## Group 2 236732 118366 1050.24 3.360e-14 *** ## baseline 1 16119 16119 143.02 5.023e-08 *** ## Residuals 12 1352 113 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
| Both group and baseline are significant in this model (p<.001). Meaning that there are significant difference between the groups at WD1 in weight. So, even at the start of the testing, groups differed based on weight. |
Implement the analyses of Chapter 9 of MABS using the BPRS data. (0-8 points)
Download and convert the data to long form
BPRS3 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS3$treatment <- factor(BPRS3$treatment)
BPRS3$subject <- factor(BPRS3$subject)
BPRSL3 <- pivot_longer(BPRS3, cols = -c(treatment, subject),
names_to = "weeks", values_to = "bprs3") %>%
arrange(weeks)
BPRSL3 <- BPRSL3 %>%
mutate(week = as.integer(substr(weeks,5,5)))
glimpse(BPRS3)
## Rows: 40
## Columns: 11
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ week0 <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week1 <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68, …
## $ week2 <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65, …
## $ week3 <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49, …
## $ week4 <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36, …
## $ week5 <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32, …
## $ week6 <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27, …
## $ week7 <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30, …
## $ week8 <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37, …
glimpse(BPRSL3)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs3 <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Same as above, but standardized
BPRSL standardized
BPRSL3_std <- BPRSL3 %>%
group_by(week) %>%
mutate(stdbprs = (bprs3 - mean(bprs3))/sd(bprs3) ) %>%
ungroup()
glimpse(BPRSL3_std) #sanity check, stdbprs
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs3 <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stdbprs <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.6983632, 0.00…
View(BPRSL3_std)
Lets make a plot where we compare subjects in treatment 1 and 2. I could figure out, how to draw similar plot than in the MABS4IODS.pdf, Figure 9.2, page 177.
I think we should have re-name the subjects based on the groups, since now both treatment 1 and 2 have subject 1 etc., that makes it hard to draw them into same plot. Maybe if treatment 1 would have subjects from 1 to 20 and treatment 2 21-40, it might work.
Plot 1
dim(BPRSL3_std) #360 rows (observations), 6 columns (variables)
## [1] 360 6
# unstandardized
ggplot(BPRSL3_std, aes(x = week, y = bprs3, group = subject, colour=subject)) +
geom_line() +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
# standardized
ggplot(BPRSL3_std, aes(x = week, y = stdbprs, group = subject, colour=subject)) +
geom_line() +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
Interpretation of Plot 1
Create a multiple linear regression model with
BPRSL3_reg <- lm(BPRSL3_std$bprs3 ~ BPRSL3_std$week + BPRSL3_std$treatment) #stdbprs as dependent and week and treatment independent variables.
summary(BPRSL3_reg) # summary of the model
##
## Call:
## lm(formula = BPRSL3_std$bprs3 ~ BPRSL3_std$week + BPRSL3_std$treatment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## BPRSL3_std$week -2.2704 0.2524 -8.995 <2e-16 ***
## BPRSL3_std$treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Interpretation of the model
When using un-standardized bprs values we see that
The previous model we assumed independence of the repeated measures of bprs, which is very unlikely. Next we will consider that the relationship is dependent (within-subject).
First we will fit the random intercept model for the same
two explanatory variables: week and treatment.
Fitting a random intercept model allows the linear regression fit for
each subject to differ in intercept from other subjects
(within-subject desing). In otherword, we are interested to investigate
how the bprs values change within subjects, and id treatment groups and
time variables are playing role in individual changes.
We use lme4 package which offers
efficient tools for fitting linear and generalized linear mixed-effects
models.
The first argument is the formula object describing both
the fixed-effects and random effects part of the model, with the
response on the left of a ~ operator and the terms, separated by +
operators, on the right. Note the random-effects terms distinguished by
vertical bars (|). Note: You should first install the
package lme4.
#install.packages("lme4")
library(lme4)
## Warning: package 'lme4' was built under R version 4.2.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRSL3_std_ref <- lmer(bprs3 ~ week + treatment + (1 | subject), data = BPRSL3_std, REML = FALSE)
summary(BPRSL3_std_ref) # Print the summary of the model
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs3 ~ week + treatment + (1 | subject)
## Data: BPRSL3_std
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Interpretation of the model
Interpretation of t-values see link
The results seem to be very similar when taking into account the random intercept model. Based on the model’s estimates and t-values we can conclude:
Next, fit the random intercept and random slope model to the bprs data. “Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences, but also the effect of time.” (Exercise6.Rmd)
BPRSL3_std_ref2 <- lmer(bprs3 ~ week + treatment + (week | subject), data = BPRSL3_std, REML = FALSE)
summary(BPRSL3_std_ref2) # Print the summary of the model
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs3 ~ week + treatment + (week | subject)
## Data: BPRSL3_std
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
Interpretation of the model
Use ANOVA to compare these two models:
BPRSL3_std__ref and BPRSL3_std__ref2anova(BPRSL3_std_ref, BPRSL3_std_ref2)
## Data: BPRSL3_std
## Models:
## BPRSL3_std_ref: bprs3 ~ week + treatment + (1 | subject)
## BPRSL3_std_ref2: bprs3 ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL3_std_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRSL3_std_ref2 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of ANOVA
Last, fit a random intercept and slope model that allows for a group × time interaction.
BPRSL3_std_ref3 <- lmer(bprs3 ~ week + treatment + week*treatment + (week | subject), data = BPRSL3_std, REML = FALSE)
summary(BPRSL3_std_ref3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs3 ~ week + treatment + week * treatment + (week | subject)
## Data: BPRSL3_std
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
Interpretation of the model
ANOVA and plot
perform an ANOVA test on the two previous models (with and without interaction) and draw a plot.
anova(BPRSL3_std_ref2, BPRSL3_std_ref3)
## Data: BPRSL3_std
## Models:
## BPRSL3_std_ref2: bprs3 ~ week + treatment + (week | subject)
## BPRSL3_std_ref3: bprs3 ~ week + treatment + week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL3_std_ref2 7 2745.4 2772.6 -1365.7 2731.4
## BPRSL3_std_ref3 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of ANOVA
Finally, we will draw a plot based on observed BPRSL3 values and fitted values.
fitted()mutate() to add the vector
Fitted as a new column to RATSLHints: - Create a vector of the fitted values of the model using the
function fitted(). Supply it with the model
RATS_ref2 - Use mutate() to add the vector
Fitted as a new column to RATSL.
Fitted <- fitted(BPRSL3_std_ref2) # Create a vector of the fitted values
BPRSL3$Fitted <- Fitted # add Fitted to BPRSL3 data frame
glimpse(BPRSL3) #sanity check
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs3 <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Fitted <dbl> 47.84592, 45.52504, 46.20831, 48.42481, 64.89317, 41.16923, …
Plots
ggplot(BPRSL3, aes(x = week, y = bprs3, group = subject, colour=subject)) +
geom_line() +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
# draw the plot of BPRSL3 with the Fitted values of weight
ggplot(BPRSL3, aes(x = week, y = Fitted, group = subject, colour=subject)) +
geom_line() +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks") +
scale_y_continuous(name = "bprs fitted") +
theme(legend.position = "top")
Interpretation of plots
End of Assignment 6.
Thank you for the course!